# Python Basic Data Science Cheatsheet

Pandas is the ultimate Python library for data manipulation/ analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

### Pandas Intro
Reading / loading data : 

To read a csv into a Pandas DataFrame, you will want to use the read_csv method. The following lines will import the pandas library and read in a csv file to a DataFrame (named df):


In [100]:
# Import the pandas module
import pandas as pd
# Create a DataFrame df to read in a csv, here the PassengerId column is the row index, so it is passed to index_col
df = pd.read_csv('titanic.csv', index_col='PassengerId')

You can preview the df object to see how it looks:

In [101]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


To access a column in a DataFrame you can pass the name of the column to the DataFrame. Here we see the first ten values of the ticket and fare columns.

In [102]:
df[['Ticket', 'Fare']].head(10)

Unnamed: 0_level_0,Ticket,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,A/5 21171,7.25
2,PC 17599,71.2833
3,STON/O2. 3101282,7.925
4,113803,53.1
5,373450,8.05
6,330877,8.4583
7,17463,51.8625
8,349909,21.075
9,347742,11.1333
10,237736,30.0708


#### Manipulating data
To produce valid inputs for a model, we will remove some of the columns in the DataFrame which will have no predictive value. For the DataFrame above, we can remove the Name, Cabin and Ticket columns with the drop method:

In [103]:
# Remove the name column as this will have no predictive power
df = df.drop(['Name','Ticket', 'Cabin'], axis=1)
# Display df again to see the change
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,0,3,male,22.0,1,0,7.25,S
2,1,1,female,38.0,1,0,71.2833,C
3,1,3,female,26.0,0,0,7.925,S
4,1,1,female,35.0,1,0,53.1,S
5,0,3,male,35.0,0,0,8.05,S


### Handling Missing Data
We can use the isnull method, and sum by the column name to get a breakdown of missing values:

In [104]:
df.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      2
dtype: int64

Common ways to deal with missing values is to drop the corresponding rows entirely, or substitute them with a 0 or the column average. This can be done with the fillna method:

In [115]:
# Replacing missing values with a 0
df = df.fillna(0)

### Dealing with categorical data

To deal with columns that contain categorical values, we will need to transform these into a numerical format so the model can infer meaning from it. One way to do this is with OneHotEncoding. This means creating a new binary column for each categorical value. So for the Sex column above, after OneHotEncoding we will have a Male column and Female column, where a 1 in the Male column indicates a Male in the original Sex column and so forth.

In our case, we will need to encode the Sex and Embarked columns. 

In [116]:
df = pd.get_dummies(df)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_0,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,22.0,1,0,7.25,0,1,0,0,0,1
2,1,1,38.0,1,0,71.2833,1,0,0,1,0,0
3,1,3,26.0,0,0,7.925,1,0,0,0,0,1
4,1,1,35.0,1,0,53.1,1,0,0,0,0,1
5,0,3,35.0,0,0,8.05,0,1,0,0,0,1


### Data Indexing and Selection 

To select a desired portion of the DataFrame, you can use the loc and iloc methods. Loc takes in the name of the columns, whereas iloc takes in numericals. The first parameter is the desired rows, and the second is the columns. So to get all of the rows of the 'Fare' column onwards, we can do:

In [117]:
df.loc[:, 'Fare':]
df.iloc[:,5:].head()

Unnamed: 0_level_0,Fare,Sex_female,Sex_male,Embarked_0,Embarked_C,Embarked_Q,Embarked_S
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,7.25,0,1,0,0,0,1
2,71.2833,1,0,0,1,0,0
3,7.925,1,0,0,0,0,1
4,53.1,1,0,0,0,0,1
5,8.05,0,1,0,0,0,1


### The model

Now that the data is in the correct format, we can begin modelling! A very popular Python library machine learning library is the Scikit-learn library. In this example, we will train a RandomForestClassifier to predict Survival for our data. First we begin by importing the Random Forest Classifier:

In [118]:
from sklearn.ensemble import RandomForestClassifier

Now we prepare the inputs and labels for our model. In this case, the labels is the first column, and the inputs will be the remaining columns. 

In [123]:
X_train = df.iloc[:, 1:].values
y_train = df.iloc[:, 0].values

We can now instantiate the RandomForestClassifier and give it the inputs and labels to fit to:

In [124]:
classifier = RandomForestClassifier(n_estimators=20, criterion='entropy')
classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Now we have a trained classifier, we can give it a new input to predict on. The new input must have the same input format as the training input.

In [125]:
y_pred = classifier.predict(X)

### Submitting the predictions to Kaggle

In [None]:
# Submitting our predictions: here we read the submission csv and apply our predictions to the Survival column
submission['Survived'] = y_pred
# Write the submission DataFrame to a csv
submission.to_csv("submission.csv", index=False)