# Using Python for Data Science:

## Introduction-

Data science is the hot new field in the information technology industry. Its popularity has been steadily increasing over the past few years. This increase in the sector has been caused by the information explosion that has taken place over the past few years. 

In fact, according to Forbes Magazine:

> * Experts are predicting a 4,300 percent increase in annual data production by 2020.
> * On average, companies use only a fraction of the data they collect and store.

The information explosion has been accompanied by a need for astute business analysts who are also equipped to program and build models- data scientists. Traditionally, these individuals preferred to use the R programming language to do their work. However, R is quickly being replaced by Python in this space.  

We can easily visualize this growing trend of using Python for data science. For example, the popular website "KDnuggets.com" posted a graph that shows pythons growing usage by searching the key words used in job postings:

![KDNuggets Python Graph](https://www.ibm.com/developerworks/community/blogs/jfp/resource/BLOGS_UPLOADED_IMAGES/trends0.png).

From this graph, it is obvious that anybody looking for a hot new job in this growing field should definitely become experienced with Python in order to further their career prospects and increase their overall earning potential. 

Lets get to it then.

## Getting Started-

In this article we will focus on the use of Python specifically for the purpose of carrying out data science tasks such as data cleansing and building machine learning models. As such, we will skip over the more general programming concepts and reference them only when needed to develop our specific tasks. 

The main components of any data science project are:

**1) Import the Required Libraries**

**2) Load and Manipulate the Data**

**3) Build Models**

**4) Compare the Results**

These are the components that we will focus on in this workshop. 

### Import the Required Libraries:

One of the things that makes python so powerful is that it is a free, open-source language. As a result, many talented people have created prepackaged bundles of code that can be used in our projects so that we dont have to start from scratch. This code is usually stored in the form of packages or libraries, which can easily be imported to our computer using a few simple steps.

In [86]:
# Import the required libraries
import os
import csv
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.metrics import roc_curve, auc

We have just imported some libraries into python that we can use to read our data and set-up our workspace.

* `os` is a library that provides many functions for setting up a machine-independent workspace in python
* `pandas` is a popular library used for reading and manipulating data stored in the form of "data frames"

Before we go further, lets set-up our workspace by defining the working directory.

In [40]:
# Find the working directory
print(os.getcwd())

/Users/zansadiq/Documents/Code/github/Thinkful


In [41]:
# Define a new directory
path = 'path/to/files'

# Change the working directory
# os.chdir(path)

### Load and Manipulate the Data:

It goes without saying that in order to carry out a data science task, we must start with some data. This can come in a variety of different formats such as `.csv` or `.xlsx`, `.json`, etc. 

Data can exist locally on our machine, or it may exist somewhere on the internet and need to be downloaded before it can be read. For today's tutorial, we will go ahead and use the famous "Titanic" dataset from the Kaggle website to play around with. The files can be downloaded [*here*](https://www.kaggle.com/c/titanic/data) and they are already split into a training and testing set for us, which is convenient. 

In [42]:
# Create an empty list to store the data
train = list()

# Load the data from a local file using the csv module
with open('train.csv') as titanic_train:
    csvReader = csv.reader(titanic_train)
    for row in csvReader:
        train.append(row)

We have just used the `csv` module to load our training data into a list. A more efficient way of storing the data is in the format of a Data Frame created using the `pandas` library. 

We will now do two things:
* Convert our list to a dataframe
* Upload the test data directly into pandas

In [43]:
# Convert the training list to a dataframe
train = pd.DataFrame(train)

In [44]:
# Load the test data directly from a url
test = pd.read_csv('https://www.kaggle.com/c/3136/download/test.csv', error_bad_lines = False)

b'Skipping line 6: expected 1 fields, saw 2\nSkipping line 11: expected 1 fields, saw 8\nSkipping line 19: expected 1 fields, saw 5\nSkipping line 20: expected 1 fields, saw 5\nSkipping line 28: expected 1 fields, saw 45\nSkipping line 30: expected 1 fields, saw 2\nSkipping line 43: expected 1 fields, saw 3\nSkipping line 44: expected 1 fields, saw 3\nSkipping line 45: expected 1 fields, saw 2\nSkipping line 51: expected 1 fields, saw 7\nSkipping line 53: expected 1 fields, saw 2\nSkipping line 57: expected 1 fields, saw 5\nSkipping line 59: expected 1 fields, saw 4\nSkipping line 60: expected 1 fields, saw 2\nSkipping line 61: expected 1 fields, saw 2\nSkipping line 69: expected 1 fields, saw 6\nSkipping line 103: expected 1 fields, saw 2\nSkipping line 115: expected 1 fields, saw 2\nSkipping line 116: expected 1 fields, saw 2\nSkipping line 117: expected 1 fields, saw 2\nSkipping line 118: expected 1 fields, saw 2\nSkipping line 119: expected 1 fields, saw 2\nSkipping line 135: expec

From the output above, we can see that the simplest method of loading the data is not always the most ideal. The error probably arises from the fact that the `.csv` was not created properly to begin with. While it is hard to say for sure, this is most likely an error from having the delimiter present within certain fields of the data

In [45]:
# Reload the test data using csv
test = list()

with open('test.csv') as titanic_test:
    csvReader = csv.reader(titanic_test)
    for row in csvReader:
        test.append(row)
        
# Convert to data frame
test = pd.DataFrame(test)

We have now uploaded our data into python and converted it into data frames. Lets go ahead and take a look at the result.

In [46]:
train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S


From the above information, we can already see that we have encountered an issue with our file import. The headers have been inserted into row 0 and are in the incorrect location.

In [47]:
# Fix the headers
train.columns = train.iloc[0]

test.columns = test.iloc[0]

# Delete the row
train = train.reindex(train.index.drop(0))

test = test.reindex(test.index.drop(0))

# Check
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


Sometimes, a dataset can contain more information than we actually need. In this particular situation, the columns "Ticket" and "Cabin" dont really provide any useful additional information. More importantly, they also have numerous missing values. 

The missing values will cause problems for us when we try to build a model because the machine learning algorithms cannot handle these cells. 

To address these issues, we will first drop the unnecessary columns and then check the remaining data for additional missing information. 

*Note- we are also going to delete the "Name" variable because this is a string and we can already identify passengers by id.*

In [48]:
# Drop unnecessary columns
train = train.drop(columns = ['Name', 'Ticket', 'Cabin'], axis = 1)

test = test.drop(columns = ['Name', 'Ticket', 'Cabin'], axis = 1)

In [49]:
# Check for missing values
train.isnull().sum()

0
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

In [50]:
# Check for missing values
test.isnull().sum()

0
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

Great, so there are no additional missing values. This means we can go ahead and move on. 

The next step is to make sure that each variable is encoded properly and that the data types are all appropriate.  

In [51]:
train.dtypes

0
PassengerId    object
Survived       object
Pclass         object
Sex            object
Age            object
SibSp          object
Parch          object
Fare           object
Embarked       object
dtype: object

The Kaggle competition asks us to predict whether or not passengers survived. Lets go ahead and explicitly encode our variables so that they are in the desired formats. 

In [56]:
# Convert factor variables
train['Survived'] = train['Survived'].astype('category')
train['Pclass'] = train['Pclass'].astype('category')
train['Sex'] = train['Sex'].astype('category')
train['Embarked'] = train['Embarked'].astype('category')

test['Pclass'] = test['Pclass'].astype('category')
test['Sex'] = test['Sex'].astype('category')
test['Embarked'] = test['Embarked'].astype('category')

In [53]:
# Convert remaining variables to numeric
train['PassengerId'] = pd.to_numeric(train['PassengerId'])
train['Age'] = pd.to_numeric(train['Age'])
train['SibSp'] = pd.to_numeric(train['SibSp'])
train['Parch'] = pd.to_numeric(train['Parch'])
train['Fare'] = pd.to_numeric(train['Fare'])

test['PassengerId'] = pd.to_numeric(test['PassengerId'])
test['Age'] = pd.to_numeric(test['Age'])
test['SibSp'] = pd.to_numeric(test['SibSp'])
test['Parch'] = pd.to_numeric(test['Parch'])
test['Fare'] = pd.to_numeric(test['Fare'])

In [54]:
# Double check for null values
train.isnull().sum()

0
PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         0
dtype: int64

In [55]:
# Double check for null values
test.isnull().sum()

0
PassengerId     0
Pclass          0
Sex             0
Age            86
SibSp           0
Parch           0
Fare            1
Embarked        0
dtype: int64

Our conversion of the variables from objects to numbers seems to have introduced some `NA`s. This could have occurred for several reasons. For example, those values could have been blank cells. Perhaps typos introduced non-numeric characters. Whatever the case may be- we must derive a way to handle these values.  

The general rule of thumb within the data science community is that null values can be deleted if they account for less than 5% of the data. In our case, the number is slightly higher. We will have to impute the missing cells. There are packages that can perform these imputations. We will rely on a simpler method of handling them by filling empty cells with averages. 

In [58]:
# Calculate average fare and assign this to missing value in test set
test.loc[test.Fare.isnull(), 'Fare'] = test['Fare'].mean()

In [59]:
# Fill the missing age cells using pandas
train['Age'].fillna(train['Age'].mean(), inplace = True)

test['Age'].fillna(test['Age'].mean(), inplace = True)

In [60]:
# Check to make sure the values have been filled
train.isnull().sum()

0
PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

In [61]:
test.isnull().sum()

0
PassengerId    0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

The next step is to split the data. However, we need to make on additional change before we can do that. The columns for "Embarked" and "Sex" have values that represent categories using letters or words. We need to convert these to numbers.

In [73]:
# Convert the "Embarked" column
lb = LabelEncoder()

train['Embarked'] = lb.fit_transform(train['Embarked'])
test['Embarked'] = lb.fit_transform(test['Embarked'])

train['Sex'] = lb.fit_transform(train['Sex'])
test['Sex'] = lb.fit_transform(test['Sex'])

#### Splitting the Data-

We have been provided with a training set and a testing set. In some situations, we could simply use the training data to make predictions for the test set. However, it is good practice to run multiple models at once. In order to evaluate these models against one another- we will have to further split our training data into a validation set as well. 

In order to split the data appropriately, we will first separate our columns into x and y (features and target variable). This will only be done for the training data because the "Survived" column is not present in the test set. 

In [87]:
# List all of the column headers
train_vars = train.columns.values.tolist()

# Select independent variables
x_train = [i for i in train_vars if i not in ['Survived']]

# Fill the values and select the dependent variable
x = train[x_train]
y = train['Survived']

In [88]:
# Convert everything to numbers
x = x.apply(pd.to_numeric)
y = y.apply(pd.to_numeric)

In [89]:
# Split the data
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = 0.3, random_state = 100)

### Build Models

Finally, we are ready to begin the process of actually building some models. There are so many to choose from. Part of a data scientist's job is knowing which models to deploy in which scenarios, and when. As it was mentioned earlier- it is usually a good idea to go ahead and try running multiple models for a given set of data because there is no clear distinction between the majority of algorithms as far as performance with regard to a given set of data. 

* In any case, our task will be to create some models and fit them to the training data.
    * Then, we will compare their accuracy on the validation set.
        * Finally, we will make predictions on the test set using our winning algorithm.
        
Once we have done all of this, we will add our predictions to the test set by initializing a "Survived" column and writing in the predicted values. The output will be saved to a `.csv` file which we can then upload to Kaggle in order to gauge our final outcome. 

#### Initializing the Algorithms-

There are many options in terms of packages and algorithms to use. One of the most popular libraries for data science in python is the `SKLearn` package. This library provides a convenient interface and syntax for building and deploying a variety of models. 

In [90]:
# Create a Decision Tree
my_tree = DecisionTreeClassifier()
tree_fit = my_tree.fit(x_train, y_train)

In [91]:
# Create a Random Forest
rf = RandomForestClassifier()
rf_fit = rf.fit(x_train, y_train)

In [92]:
# Create a Logistic Regression
lr = LogisticRegression()
lr_fit = lr.fit(x_train, y_train)

#### Make Predictions on the Validation Set-

In [93]:
# Predict
tree_preds = tree_fit.predict(x_val)
rf_preds = rf_fit.predict(x_val)
lr_preds = lr.predict(x_val)

### Compare the Results:

#### Assess the Models-

Now that we have made our models and predictions, it is time to assess them. This can be done in one of several ways. 

This problem is unique in that it is "one-class" (the target variable is binary with an outcome of 0 or 1).

A common means of assessing a "one-class" model is using an ROC curve. This is a plot of specificity vs. sensitivity for the predictions. 

* **Specificity**- True Positive Rate
* **Sensitivity**- True Negative Rate

The main point is that, when you have an ROC curve for a given class- the larger "area under the curve" (AUC) is an indicator of a better model. Alternatively, we can also use pure accuracy as well. 

In [94]:
# Accuracy
print("Validation set, decision tree accuracy:", accuracy_score(y_val, tree_preds))
print("Validation set, random forest accuracy:", accuracy_score(y_val, rf_preds))
print("Validation set, logistic regression accuracy:", accuracy_score(y_val, lr_preds))

Validation set, decision tree accuracy: 0.7574626865671642
Validation set, random forest accuracy: 0.8171641791044776
Validation set, logistic regression accuracy: 0.7835820895522388


It would appear that the Random Forest model is the most accurate. Lets also take a look at the AUC values for each model to confirm:

In [95]:
# ROC Curve: Decision Tree
fpr, tpr, _ = roc_curve(y_val, tree_preds)
tree_roc_auc = auc(fpr, tpr)
print("The AUC for the Decision Tree is", tree_roc_auc)

The AUC for the Decision Tree is 0.7465524205181467


In [97]:
# ROC Curve: Random Forest
fpr, tpr, _ = roc_curve(y_val, rf_preds)
rf_roc_auc = auc(fpr, tpr)
print("The AUC for the Random Forest is", rf_roc_auc)

The AUC for the Random Forest is 0.7954243840516992


In [98]:
# ROC Curve: Logistic Regression
fpr, tpr, _ = roc_curve(y_val, lr_preds)
lr_roc_auc = auc(fpr, tpr)
print("The AUC for the Logistic Regression is", lr_roc_auc)

The AUC for the Logistic Regression is 0.7613524897582368


## Conclusion

Based upon our results it would appear as if the most reliable model is the random forest. Now, what we can do is upload our predictions on the test to Kaggle: this will give us a final assessment of our model because Kaggle will score the predictions for the unlabeled test set and give us our final accuracy.

#### Write the Results to CSV-

In [99]:
# Fill a new column in the test data for the predictions
test['Survived'] = rf_fit.predict(test)

In [103]:
# Create a new df for the results
out = pd.DataFrame(test['PassengerId'])
out['Survived'] = test['Survived']

There are two ways that we can make the final output: using pandas or base python code. Both formats are shown below.

In [None]:
# Set index to PassengerId for submission
out.set_index('PassengerId', inplace = True)

In [None]:
# Use pandas to create the file
out.to_csv('kaggle_submission.csv')

In [104]:
# Define the file name and header
csv_header = 'PassengerId, Survived' 
file_name = 'kaggle_submission.csv'

# Function to create the csv
def print_results(file_name, csv_header, data): 
    with open(file_name,'wt') as f:
        print(csv_header, file = f) 
        for s in data:
            print(','.join(s), file = f)

# Call function
print_results(file_name, csv_header, out)

#### Kaggle Submission

Well, it looks like our accuracy on the test set was roughly equal to that of the validation data. This results in a score of .74 on Kaggle. Feel free to try and run some other models at home, or use feature engineering/selection to see if you cant make any improvements on our final score. Good luck!

![Kaggle Submission](kaggle_submission.png)