# Data pre-processing Libaries for 2018

By Gabe Wilberscheid

If you got into Data-Science and Machine Learing because you thought most your day you would be working with AI and playing with friendly robots, I have bad news for you. That day is not here, yet. Most of your time is actually spent collecting good data, then "cleaning" that data so you can feed your models good features. In this notebook I will walk you through the steps required to get your data ready for machine learning.

### Steps in Data Cleaning
1.  Collect your Data
2.  Check that data for missing values, or Nan
3.  Decide if you are going to drop that data or if you can accurately fill in missing data
4.  Work with catigorical data and create numerical represntations
5.  Understand what insight you can extract from your data
6.  Try to create new features that may generate insight for your model
7.  Normalize, encode and create dummy variables, one-hot for a dummy

We will go into more detail once we get to each step

## Libaries
Using python for machine learning only requires these 3 main libaries that are indistruy standard.
### Pandas
https://pandas.pydata.org/
Pandas is used for importing files, locating your data, checking for Nans, and much more. Pandas uses a data structure called a DataFrame, that is a pandas object with many useful methods.
### Numpy
http://www.numpy.org/
Numpys main use is as a nD-array object, with several useful methods. We will often put these arrays into a pandas DataFrame, and later use them to extract single feature arrays to feed models.
### Scikit-Learn
http://scikit-learn.org/stable/
Scikit-learn is a package used to implement algorithms and machine learning. It also has several great features for pre-processing data, that is what we will look at here.

## Demo
There is no better way to learn something by doing it. And there is no beter way to check your understanding of a subject than by teching it.

In [2]:
# The import as name as standard in practice, I suggest you use these
import numpy as np
import pandas as pd
import sklearn as skl

In [3]:
# We will need some data to work with to show the features of these libaries
# Luckly these libaries have some pre-loaded datasets, we need to import these
from sklearn import datasets

iris_data = datasets.load_iris()
print(type(iris_data))
# this will not load in a DataFrame or np.array, which we like to work with

<class 'sklearn.utils.Bunch'>


In [4]:
df = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
print(type(df))
# now this is a DataFrame
df.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#### Pandas DataFrames
we can get alot of info out of a pandas df.

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB


In [20]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [8]:
# pandas also allows for the creation of new Data Frames or series
s = pd.Series([1,3,4,np.nan, 7,8])
s

0    1.0
1    3.0
2    4.0
3    NaN
4    7.0
5    8.0
dtype: float64

In [11]:
# we can also generate random or dummy values
date = pd.date_range('20180920', periods=6)
date

DatetimeIndex(['2018-09-20', '2018-09-21', '2018-09-22', '2018-09-23',
               '2018-09-24', '2018-09-25'],
              dtype='datetime64[ns]', freq='D')

In [15]:
# and of course combined different arrys and series together
dummy_df = pd.DataFrame(np.random.randn(6,4), index=date, columns=('A','B','C','D'))
dummy_df

Unnamed: 0,A,B,C,D
2018-09-20,-0.45516,1.000004,0.333303,-2.477724
2018-09-21,1.53546,0.253518,0.713401,-1.797877
2018-09-22,2.003688,1.217865,1.080574,-0.914407
2018-09-23,-0.79626,0.39982,-0.075177,0.506811
2018-09-24,-1.379546,1.085149,0.434317,-0.203744
2018-09-25,2.200222,0.263414,0.510222,-1.090588


In [17]:
# we can also pass in a dict into a dataframe to be converted to series
df2 = pd.DataFrame({ 'A' : 1.,
    'B' : pd.Timestamp('20130102'),
    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
    'D' : np.array([3] * 4,dtype='int32'),
    'E' : pd.Categorical(["test","train","test","train"]),
    'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [36]:
# back to our dataframe, we can count the number of times a values appears
df['sepal length (cm)'].value_counts()

5.0    10
6.3     9
5.1     9
6.7     8
5.7     8
5.5     7
5.8     7
6.4     7
6.0     6
4.9     6
6.1     6
5.4     6
5.6     6
6.5     5
4.8     5
7.7     4
6.9     4
5.2     4
6.2     4
4.6     4
7.2     3
6.8     3
4.4     3
5.9     3
6.6     2
4.7     2
7.6     1
7.4     1
4.3     1
7.9     1
7.3     1
7.0     1
4.5     1
5.3     1
7.1     1
Name: sepal length (cm), dtype: int64

In [32]:
# We can perform some simple statics on or data
# Series: no axis argument needed
# DataFrame: “index” (axis=0, default), “columns” (axis=1)
df.mean()

sepal length (cm)    5.843333
sepal width (cm)     3.054000
petal length (cm)    3.758667
petal width (cm)     1.198667
sepal ratio          0.534428
dtype: float64

In [24]:
# we can also del and pop off columns in our df, or select individual cols
dfp = df.copy() # we make a copy so we do not affect our df
del dfp['sepal length (cm)']

In [26]:
dfp.head()

Unnamed: 0,sepal width (cm),petal length (cm),petal width (cm)
0,3.5,1.4,0.2
1,3.0,1.4,0.2
2,3.2,1.3,0.2
3,3.1,1.5,0.2
4,3.6,1.4,0.2


In [30]:
# we can be clever and create new feature columns out of old ones
# and with one line add the to our pandas dataframe
df['sepal ratio'] = df['sepal width (cm)'] / df['sepal length (cm)']
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal ratio
0,5.1,3.5,1.4,0.2,0.686275
1,4.9,3.0,1.4,0.2,0.612245
2,4.7,3.2,1.3,0.2,0.680851
3,4.6,3.1,1.5,0.2,0.673913
4,5.0,3.6,1.4,0.2,0.72


In [44]:
# we can also eaily sort by values or index / head(10) is to show less
df.sort_values(by='sepal ratio', ascending=False).head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),sepal ratio
32,5.2,4.1,1.5,0.1,0.788462
22,4.6,3.6,1.0,0.2,0.782609
15,5.7,4.4,1.5,0.4,0.77193
33,5.5,4.2,1.4,0.2,0.763636
46,5.1,3.8,1.6,0.2,0.745098
44,5.1,3.8,1.9,0.4,0.745098
19,5.1,3.8,1.5,0.3,0.745098
6,4.6,3.4,1.4,0.3,0.73913
42,4.4,3.2,1.3,0.2,0.727273
21,5.1,3.7,1.5,0.4,0.72549
