## Pandas | Data Parsing and Mining

#### Start out by importing the pandas library, and we'll tell Python that we're referring to it as "pd" (for the sake of brevity)

In [2]:
import pandas as pd

#### Read in a CSV file into a pandas dataframe we'll call "df"
#### The csv reading function assumes that each person is on a new line, that the variable names are at the top of the file, and all entries are separated by a column

In [35]:
df = pd.read_csv('../dataset/titanic.csv')
print df

     survived  pclass                                               name  \
0           0       3                            Braund, Mr. Owen Harris   
1           1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2           1       3                             Heikkinen, Miss. Laina   
3           1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4           0       3                           Allen, Mr. William Henry   
5           0       3                                   Moran, Mr. James   
6           0       1                            McCarthy, Mr. Timothy J   
7           0       3                     Palsson, Master. Gosta Leonard   
8           1       3  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   
9           1       2                Nasser, Mrs. Nicholas (Adele Achem)   
10          1       3                    Sandstrom, Miss. Marguerite Rut   
11          1       1                           Bonnell, Miss. Elizabeth   
12          

#### You can view the variable/column names

In [15]:
list(df.columns.values)

['survived',
 'pclass',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked',
 'age_recode']

#### You can view how many data entries/cases there are

In [6]:
print len(df.index)

891


#### You can access a specific column/variable using brackets and the column name in quotations

In [7]:
print df['sex']

#or

print df.sex

0        male
1      female
2      female
3      female
4        male
5        male
6        male
7        male
8      female
9      female
10     female
11     female
12       male
13       male
14     female
15     female
16       male
17       male
18     female
19     female
20       male
21       male
22     female
23       male
24     female
25     female
26       male
27       male
28     female
29       male
        ...  
861      male
862    female
863    female
864      male
865    female
866    female
867      male
868      male
869      male
870      male
871    female
872      male
873      male
874    female
875    female
876      male
877      male
878      male
879    female
880    female
881      male
882    female
883      male
884      male
885    female
886      male
887    female
888    female
889      male
890      male
Name: sex, dtype: object
0        male
1      female
2      female
3      female
4        male
5        male
6        male
7        male
8      fe

#### You can also access multiple columns at once

In [8]:

print df[["sex","fare"]]

        sex      fare
0      male    7.2500
1    female   71.2833
2    female    7.9250
3    female   53.1000
4      male    8.0500
5      male    8.4583
6      male   51.8625
7      male   21.0750
8    female   11.1333
9    female   30.0708
10   female   16.7000
11   female   26.5500
12     male    8.0500
13     male   31.2750
14   female    7.8542
15   female   16.0000
16     male   29.1250
17     male   13.0000
18   female   18.0000
19   female    7.2250
20     male   26.0000
21     male   13.0000
22   female    8.0292
23     male   35.5000
24   female   21.0750
25   female   31.3875
26     male    7.2250
27     male  263.0000
28   female    7.8792
29     male    7.8958
..      ...       ...
861    male   11.5000
862  female   25.9292
863  female   69.5500
864    male   13.0000
865  female   13.0000
866  female   13.8583
867    male   50.4958
868    male    9.5000
869    male   11.1333
870    male    7.8958
871  female   52.5542
872    male    5.0000
873    male    9.0000
874  femal

#### You can view either the first N or last N set of rows

In [10]:
N = 5

print df.head(2)

   survived  pclass                                               name  \
0         0       3                            Braund, Mr. Owen Harris   
1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   

      sex   age  sibsp  parch     ticket     fare cabin embarked  
0    male  22.0      1      0  A/5 21171   7.2500   NaN        S  
1  female  38.0      1      0   PC 17599  71.2833   C85        C  


In [None]:
print df.tail(N)

#### You can recode a variable by mapping on the old value to new values

In [11]:
df['pclass'] = df["pclass"].astype("str")
df["pclass"].replace({'1': 'upper', '2' : 'middle', '3' : "lower"})

0       lower
1       upper
2       lower
3       upper
4       lower
5       lower
6       upper
7       lower
8       lower
9      middle
10      lower
11      upper
12      lower
13      lower
14      lower
15     middle
16      lower
17     middle
18      lower
19      lower
20     middle
21     middle
22      lower
23      upper
24      lower
25      lower
26      lower
27      upper
28      lower
29      lower
        ...  
861    middle
862     upper
863     lower
864    middle
865    middle
866    middle
867     upper
868     lower
869     lower
870     lower
871     upper
872     upper
873     lower
874    middle
875     lower
876     lower
877     lower
878     lower
879     upper
880    middle
881     lower
882     lower
883    middle
884     lower
885     lower
886    middle
887     upper
888     lower
889     upper
890     lower
Name: pclass, dtype: object

In [17]:
df.ix[df.age >= 18, 'age_recode'] = "adult"
df.ix[df.age < 18, 'age_recode'] = "child"
df.columns.values

array(['survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch',
       'ticket', 'fare', 'cabin', 'embarked', 'age_recode'], dtype=object)

#### You can create new variables

In [18]:
df["familymembers"] = df["sibsp"] + df["parch"] 
print df[["sibsp","parch","familymembers"]]

     sibsp  parch  familymembers
0        1      0              1
1        1      0              1
2        0      0              0
3        1      0              1
4        0      0              0
5        0      0              0
6        0      0              0
7        3      1              4
8        0      2              2
9        1      0              1
10       1      1              2
11       0      0              0
12       0      0              0
13       1      5              6
14       0      0              0
15       0      0              0
16       4      1              5
17       0      0              0
18       1      0              1
19       0      0              0
20       0      0              0
21       0      0              0
22       0      0              0
23       0      0              0
24       3      1              4
25       1      5              6
26       0      0              0
27       3      2              5
28       0      0              0
29       0

#### You can code variables by their range

In [22]:
df['familymembers'].apply(lambda x: x**2)

0        1
1        1
2        0
3        1
4        0
5        0
6        0
7       16
8        4
9        1
10       4
11       0
12       0
13      36
14       0
15       0
16      25
17       0
18       1
19       0
20       0
21       0
22       0
23       0
24      16
25      36
26       0
27      25
28       0
29       0
      ... 
861      1
862      0
863    100
864      0
865      0
866      1
867      0
868      0
869      4
870      0
871      4
872      0
873      0
874      1
875      0
876      0
877      0
878      0
879      1
880      1
881      0
882      0
883      0
884      0
885     25
886      0
887      0
888      9
889      0
890      0
Name: familymembers, dtype: int64

#### You can combine two different data sources that share a common attribute (e.g., "name")

In [26]:
survivors = df[["name","survived"]]
ticketprices = df[["name","fare"]]
merged = survivors.merge(ticketprices, left_on="name", right_on="name")
print merged



[(1, 1), (2, 2), (3, 3)]

## Examine Categorical Variables for Irregularities

#### Examine the different values entered

In [27]:
pd.unique(df["sex"])

array(['male', 'female'], dtype=object)

#### Make all string values lowercase

In [28]:
df['name'] = df['name'].apply(lambda x: x.lower())
print df['name']

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
5                                       moran, mr. james
6                                mccarthy, mr. timothy j
7                         palsson, master. gosta leonard
8      johnson, mrs. oscar w (elisabeth vilhelmina berg)
9                    nasser, mrs. nicholas (adele achem)
10                       sandstrom, miss. marguerite rut
11                              bonnell, miss. elizabeth
12                        saundercock, mr. william henry
13                           andersson, mr. anders johan
14                  vestrom, miss. hulda amanda adolfina
15                      hewlett, mrs. (mary d kingcome) 
16                                  rice, master. eugene
17                          wil

#### Fix string values that might have an extra hidden spae before or after

In [29]:
df['name'] = df['name'].apply(lambda x: x.strip())
print df['name']

0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                 heikkinen, miss. laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
5                                       moran, mr. james
6                                mccarthy, mr. timothy j
7                         palsson, master. gosta leonard
8      johnson, mrs. oscar w (elisabeth vilhelmina berg)
9                    nasser, mrs. nicholas (adele achem)
10                       sandstrom, miss. marguerite rut
11                              bonnell, miss. elizabeth
12                        saundercock, mr. william henry
13                           andersson, mr. anders johan
14                  vestrom, miss. hulda amanda adolfina
15                       hewlett, mrs. (mary d kingcome)
16                                  rice, master. eugene
17                          wil

####  Fix string values that might be mispelled

In [None]:
import difflib
df['sex'] = df['sex'].apply(lambda x: difflib.get_close_matches(x,["male","female"])[0])
print df['sex']

## Examine quantitative variables for irregularities

#### View the central tendency, variability, minimum, maximum, and quartiles to check for impossible or unexpect values

In [30]:
df["fare"].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: fare, dtype: float64

#### View a histogram that shows the distribution of the variables.

In [32]:
import matplotlib.pyplot as plt
plt.figure();
df["fare"].plot.hist(alpha=0.5)
plt.show()

#### Restrict the values that data can take (upper and lower limits)

In [33]:
df.loc[:,'age'] = df['age'].clip(lower=0, upper=100)

## Handling Missing Data

#### Checking for missing data

In [34]:
missing_count = df['age'].isnull().sum()
print missing_count

missing_data = df[pd.isnull(df['age'])]
print missing_data

177
     survived pclass                                            name     sex  \
5           0      3                                moran, mr. james    male   
17          1      2                    williams, mr. charles eugene    male   
19          1      3                         masselmani, mrs. fatima  female   
26          0      3                         emir, mr. farred chehab    male   
28          1      3                   o'dwyer, miss. ellen "nellie"  female   
29          0      3                             todoroff, mr. lalio    male   
31          1      1  spencer, mrs. william augustus (marie eugenie)  female   
32          1      3                        glynn, miss. mary agatha  female   
36          1      3                                mamee, mr. hanna    male   
42          0      3                             kraeff, mr. theodor    male   
45          0      3                        rogers, mr. william john    male   
46          0      3                

#### Drop rows that have a missing value for age

In [None]:
df = df[pd.notnull(df['age'])]
print df['age']

#### Replace missing values with the:
   1) median (middle value) for that column
   
   2) mode (most frequent value) for that column

In [None]:
df['age_medianimpute'] = df['age'].fillna(df['age'].median())
df['age_modeimpute'] = df['age'].fillna(df['age'].mode())

#### Replace missing values with the likely value

#### Use a linear regression formula to predict what the missing value would have been

In [None]:
from sklearn import linear_model, preprocessing
import numpy as np


sex_encoder = preprocessing.LabelEncoder().fit(df["sex"])
df['sex_coded'] = sex_encoder.transform(df["sex"])

df_complete = df.dropna()
regr = linear_model.LinearRegression()
regr.fit(df_complete[["fare","sex_coded"]],df_complete['age'])

df.loc[:,'age'] = df.apply(lambda x: regr.predict(df[["fare","sex_coded"]])[0] if pd.isnull(x['age']) else x['age'], axis=1)

#### Use the 2 most similar cases to predict what the missing value would have been

In [None]:
from sklearn.neighbors import KNeighborsRegressor

sex_encoder = preprocessing.LabelEncoder().fit(df["sex"])
df['sex_coded'] = sex_encoder.transform(df["sex"])


pclass_encoder = preprocessing.LabelEncoder().fit(df["pclass"])
df['pclass_coded'] = pclass_encoder.transform(df["pclass"])

df_complete = df.dropna()

neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(df_complete[["fare","sex_coded","pclass_coded","sibsp"]], df_complete["age"])

df.loc[:,'age'] = df.apply(lambda x: neigh.predict(df[["fare","sex_coded","pclass_coded","sibsp"]])[0] if pd.isnull(x['age']) else x['age'], axis=1)


## Apply the lesson

#### Import the pandas library (call it pd for short). 

#### Read in the dataset called "redwinequality.csv" in the datasets folder and save it to a dataframe called "df"

#### What are the variables collected on each wines?

#### What is the maximum and minimum pH value?

#### How does the distribution of quality ratings look like?

#### Some of the volatile acidity measurements are missing. How many of the measurements are missing? What proportion of wines are missing a volatile acidity rating?

#### Fill in the missing values for volatile acidity with the median. Create a new column with the filled in missing data. Call that column: va_medianimpute

#### Fill in the missing values for volatile acidity with using the 8 most similar cases/neighbors.  Base similarity off the variables: fixedacidity, citricacid, residualsugar, and chlorides

#### Create a new column with the filled in missing data. Call that column: va_neighborimpute

#### Load in the complete wine dataset (redwinequality_complete.csv), which has the true values for the missing values. Call the dataframe: df_true

#### Compare the absolute difference between the volatile acidity column of the complete dataset with va_medianimpute and the va_neighborimpute variables you created.

#### Which imputation method has the smallest MEAN/AVERAGE absolute difference?

#### You want to recode the chlorides as either being "high" or "low" if the chloride level is above or below (or equal to) .25. Create a new variable called: "chlorides_category" that follows those rules. Show the new coded variable next to the old numeric variable.