# Beginners guide towards Data Science

## Table of Contents

- Finding data
- Missing value
- How to deal categorical data
- Overviewing data
- Graphing data


## Finding data

There are many resources you can find online. This is [a collection of data sources](https://github.com/shinokada/python-for-ib-diploma-mathematics/blob/master/Data_sources.ipynb). CSV file is a comma-separated values file and is the easiest file to work on. Most of the data sources provided a CSV file. Find a couple of data you are interested in from the list. Download them to your hard drive.

## Missing Value

<center><img src="image/missing.png"></center>

First we import a sample data which has ?, na, NA, N/A no data. We check if the dataframe has any null values using `isnull()`. `any()` outputs by columns if there are any null values.

In [73]:
import pandas as pd

df = pd.read_csv('./Data/missing.csv')
df
df.isnull()
df.isnull().any()

Unnamed: 0,col1,col2,col3
0,11.0,20,cat
1,48.0,?,dog
2,,47,dog
3,35.0,na,cat
4,48.0,0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


Unnamed: 0,col1,col2,col3
0,False,False,False
1,False,False,False
2,True,False,False
3,False,False,False
4,False,False,False
5,False,True,False
6,False,True,False
7,False,True,False


col1     True
col2     True
col3    False
dtype: bool

Please note that NaN will replace data item where there is no data.
Comparing the first two tables, ? and na are not recognized as null. So we need to use `na_values` to recognized them as null values. This will change to NAN.

In [74]:
df = pd.read_csv('./Data/missing.csv', na_values=['?','na'])
df
df.info()
df.isnull().any()

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,,dog
2,,47.0,dog
3,35.0,,cat
4,48.0,0.0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
col1    7 non-null float64
col2    3 non-null float64
col3    8 non-null object
dtypes: float64(2), object(1)
memory usage: 320.0+ bytes


col1     True
col2     True
col3    False
dtype: bool

## Dealing with NaN values 
### Drop Nan

One way to deal with null values is to drop rows or columns. `dropna()` has the default value of `axis=0`. So `dropna()` and `dropna(axis=0)` outputs the same table. This will drop all rows where there is a null value.

In [113]:
df = pd.read_csv('./Data/missing.csv', na_values=['?','na'])
df
df.dropna()

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,,dog
2,,47.0,dog
3,35.0,,cat
4,48.0,0.0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
4,48.0,0.0,fox


In [76]:
df.dropna(axis=0)

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
4,48.0,0.0,fox


`axis=1` will drop columns where there is NaN in the column.

In [77]:
df.dropna(axis=1)

Unnamed: 0,col3
0,cat
1,dog
2,dog
3,cat
4,fox
5,dog
6,cat
7,cat


### Filling up NaN

We can fill up NaN value with the column mean. The following fill NaN in each column with its column mean.

In [87]:
df = pd.read_csv('./Data/missing.csv', na_values=['?','na'])
df
df=df.fillna(value={'col1':df['col1'].mean(),'col2':df['col2'].mean()})
df

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,,dog
2,,47.0,dog
3,35.0,,cat
4,48.0,0.0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,22.333333,dog
2,26.857143,47.0,dog
3,35.0,22.333333,cat
4,48.0,0.0,fox
5,35.0,22.333333,dog
6,9.0,22.333333,cat
7,2.0,22.333333,cat


You can use median as well.

In [88]:
df = pd.read_csv('./Data/missing.csv', na_values=['?','na'])
df
df=df.fillna(value={'col1':df['col1'].median(),'col2':df['col2'].median()})
df

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,,dog
2,,47.0,dog
3,35.0,,cat
4,48.0,0.0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,20.0,dog
2,35.0,47.0,dog
3,35.0,20.0,cat
4,48.0,0.0,fox
5,35.0,20.0,dog
6,9.0,20.0,cat
7,2.0,20.0,cat


Or any value you think it is appropriate.

In [89]:
df = pd.read_csv('./Data/missing.csv', na_values=['?','na'])
df
df=df.fillna(value={'col1':100,'col2':200})
df

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,,dog
2,,47.0,dog
3,35.0,,cat
4,48.0,0.0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,200.0,dog
2,100.0,47.0,dog
3,35.0,200.0,cat
4,48.0,0.0,fox
5,35.0,200.0,dog
6,9.0,200.0,cat
7,2.0,200.0,cat


# How to deal categorical data

Our data have a cat, dog, and fox under the col3. We use `get_dummies()` to convert categorical variable to dummy/indicator variables. This will create columns with all categories and add either 0 or 1.

In [111]:
df = pd.read_csv('./Data/missing.csv', na_values=['?','na'])
df
dummies = pd.get_dummies(df['col3'])
dummies

Unnamed: 0,col1,col2,col3
0,11.0,20.0,cat
1,48.0,,dog
2,,47.0,dog
3,35.0,,cat
4,48.0,0.0,fox
5,35.0,,dog
6,9.0,,cat
7,2.0,,cat


Unnamed: 0,cat,dog,fox
0,1,0,0
1,0,1,0
2,0,1,0
3,1,0,0
4,0,0,1
5,0,1,0
6,1,0,0
7,1,0,0


In [110]:
df = pd.concat([df,dummies],axis=1)
df

Unnamed: 0,col1,col2,col3,cat,dog,fox
0,11.0,20.0,cat,1,0,0
1,48.0,,dog,0,1,0
2,,47.0,dog,0,1,0
3,35.0,,cat,1,0,0
4,48.0,0.0,fox,0,0,1
5,35.0,,dog,0,1,0
6,9.0,,cat,1,0,0
7,2.0,,cat,1,0,0


In [104]:
df.drop('col3', axis=1, inplace=True)
df

Unnamed: 0,col1,col2,animal_cat,animal_dog,animal_fox
0,11.0,20.0,1,0,0
1,48.0,,0,1,0
2,,47.0,0,1,0
3,35.0,,1,0,0
4,48.0,0.0,0,0,1
5,35.0,,0,1,0
6,9.0,,1,0,0
7,2.0,,1,0,0


## Preparing/Cleaning data

https://github.com/shinokada/t81_558_deep_learning/blob/master/t81_558_class_02_1_python_pandas.ipynb

https://github.com/shinokada/t81_558_deep_learning/blob/master/t81_558_class_02_2_pandas_cat.ipynb

## Overviewing data

## Graphing data

/Users/shinokada/DataScience/seaborn-visualizing-statistical-data/02/demos/m1-demo2-VisualizingDistributionsToFindPatterns.ipynb