# Exploratory Data Analysis

## Overview

In order to practice Loading Datasets, we're going to use the [Flags Dataset](https://archive.ics.uci.edu/ml/datasets/Flags) from UCI to show both loading the dataset via its URL and from a local file.

Steps for loading a dataset:

1) Learn as much as you can about the dataset:
 - Number of rows
 - Number of columns
 - Column headers (Is there a "data dictionary"?)
 - Is there missing data?
 - **OPEN THE RAW FILE AND LOOK AT IT. IT MAY NOT BE FORMATTED IN THE WAY THAT YOU EXPECT.**

2) Try loading the dataset using `pandas.read_csv()` and if things aren't acting the way that you expect, investigate until you can get it loading correctly.

3) Keep in mind that functions like `pandas.read_csv()` have a lot of optional parameters that might help us change the way that data is read in. If you get stuck, google, read the documentation, and try things out.

4) You might need to type out column headers by hand if they are not provided in a neat format in the original dataset. It can be a drag.

### Learn about the dataset and look at the raw file.

In [1]:
# Find the actual file to download
# From navigating the page, clicking "Data Folder"
# Right click on the link to the dataset and say "Copy Link Address"

flag_data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data'

# You can "shell out" in a notebook for more powerful tools
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html

# Funny extension, but on inspection looks like a csv

# Extensions are just a norm! You have to inspect to be sure what something is

### Attempt to load it via its URL

In [2]:
# Load the flags dataset from its URL:


In [3]:
# how does it look?

If things go wrong, investigate and try to figure out why.


#### Different ways to look at the documentation:
https://archive.ics.uci.edu/ml/machine-learning-databases/flags

### Try Again

In [4]:
# Keep on trying things until you get it. 
# If you really mess things up you can always just restart your runtime
column_headers = ['name', 'landmass', 'zone', 'area', 'population', 'language', 'religion', 'bars', 'stripes', 'colors', 'red', 'green',
                  'blue', 'gold', 'white', 'black', 'orange', 'mainhue', 'circles', 'crosses', 'saltires', 'quarters', 'sunstars', 'crescent',
                  'triangle', 'icon', 'animate', 'text', 'topleft', 'botright']

In [5]:
# try again


### Load a dataset (CSV) from a local file

In [6]:
# now load it from the "data" folder in our repo

In [7]:
# save your output as a new file

### Use basic Pandas functions for Exploratory Data Analysis (EDA)

## Overview

> Exploratory Data Analysis (EDA) refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations


## Follow Along

What can we discover about this dataset?

- df.shape
- df.head()
- df.dtypes
- df.describe()
 - Numeric
 - Non-Numeric
- df['column'].value_counts()
- df.isnull().sum()
- df.fillna()
- df.dropna()
- df.drop()
- pd.crosstab()

Lets try reading in a new dataset: The Adult Dataset
https://archive.ics.uci.edu/ml/datasets/adult

In [8]:
# a new dataset

In [9]:
column_headers2=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship',
                 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

In [10]:
# try reading it in

In [11]:
#shape

In [12]:
# what columns?

In [13]:
# data types

In [14]:
# summary stats

In [15]:
# compare to the whole dataframe

In [16]:
# filter to specific columns

In [17]:
# filter to multiple columns

In [18]:
# filter specific rows

In [19]:
# frequencies of a column

In [20]:
# what about missing values?

In [21]:
# check for any missing data!

In [22]:
# recode that question mark as missing!

In [23]:
# check for any missing data!

In [24]:
# drop all missing data!

In [25]:
# check for any missing data!

In [26]:
# create some missing data!

In [27]:
# nothing here

In [28]:
# check for any missing data!

In [29]:
# replace missing values!

In [30]:
# value freqs

In [31]:
# how to build a crosstab!

In [32]:
# another example

In [33]:
# what if you want pcts?

In [34]:
# what if you want margins?

### Generate Basic Visualizations (graphs) with Pandas

## Overview

One of the cornerstones of Exploratory Data Analysis (EDA) is visualizing our data in order to understand their distributions and how they're interrelated. Our brains are amazing pattern detection machines and sometimes the "eyeball test" is the most efficient one. In this section we'll look at some of the most basic kinds of "exploratory visualizations" to help us better understand our data.

## Follow Along

Lets demonstrate creating a:

- Line Plot
- Histogram
- Scatter Plot
- Density Plot
- Making plots of our crosstabs

How does each of these plots show us something different about the data? 

Why might it be important for us to be able to visualize how our data is distributed?

### Line Plot

In [35]:
# freqs

In [36]:
# show that in a plot

### Histogram

In [37]:
# default has 10 bins

In [38]:
# add more bins

### Scatter Plot


In [39]:
# take a sample

In [40]:
# combine these numeric variables in a scatter plot!

In [41]:
# let's get rid of outliers!

In [42]:
# combine these numeric variables in a scatter plot!

### Density Plot - Kernel Density Estimate (KDE)

In [43]:
# similar to histogram

### Plotting using Crosstabs

In [44]:
# crosstabs code

In [45]:
# show using a bar chart

## Challenge

These are some of the most basic and important types of data visualizations. They're so important that they're built straight into Pandas and can be accessed with some very concise code. At the beginning our data exploration is about understanding the characteristics of our dataset, but over time it becomes about communicating insights in as effective and digestable a manner as possible, and that typically means using graphs in one way or another. See how intuitive of a graph you can make using a crosstab on this dataset.