### __Title__: Introduction to Machine Learning (ML) using Python __[Mini-Project](https://clemsonciti.github.io/Workshop-Python-ML/15-Mini-Project/index.html)__

&nbsp;

##### __Purpose__: To evaluate the ability of the user (`madonay`) working on a sample data science project from scratch
##### __Date__: 20210629
##### __Author__: Maria E. Adonay (`madonay`)

&nbsp;

##### __Note__: The project is about Supervised ML only and involves the following:

&nbsp;

> #####       - Downloading data
> #####       - Cleaning data
> #####       - Splitting data into training / testing
> #####       - Applying a machine learning model
> #####       - Analyzing the output

&nbsp;

##### __Data__: The __[Titanic data](https://github.com/clemsonciti/Workshop-Python-ML/tree/master/data/Titanic_data)__ will be used for this project. The columns may be summarized as follows:

&nbsp;

> #####       - Column 1: `PassengerId` - ID number
> #####       - Column 2: `Survived` - Indication of survival (0: No; 1: Yes)
> #####       - Column 3: `Pclass` - Ticket class (1: 1st; 2: 2nd; 3: 3rd)
> #####       - Column 4: `Name` - Name
> #####       - Column 5: `Sex` - Sex (male, female)
> #####       - Column 6: `Age` - Age (years)
> #####       - Column 7: `SibSp` - Number of siblings / spouses aboard
> #####       - Column 8: `Parch` - Number of parents / children aboard
> #####       - Column 9: `Ticket` - Ticket number
> #####       - Column 10: `Fare` - Price for transport
> #####       - Column 11: `Cabin` - Cabin number
> #####       - Column 12: `Embarked` - Port of initial boarding (C: Cherbourg; Q: Queenstown; S: Southampton)

&nbsp;

##### _Source_: __[Department of Biostatistics at Vanderbilt University](https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic.html)__ via __[Clemson CITI](http://citi.sites.clemson.edu/)__ __[Titanic_data GitHub Repository](https://github.com/clemsonciti/Workshop-Python-ML/tree/master/data/Titanic_data)__
##### _Predictors_: `Pclass`, `Age`, `SibSp`, `Parch`
##### _Predictand_: `Survived`
##### _ML Model Output Type_: (Random Forest) Classification

&nbsp;

##### For more information: __https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3info.txt__

&nbsp;

***

&nbsp;

In [1]:
# 0/8: Import libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [2]:
# 1/8: Read in the data

url_train = 'https://raw.githubusercontent.com/clemsonciti/Workshop-Python-ML/master/data/Titanic_data/train.csv'
train = pd.DataFrame(pd.read_csv(url_train))

#train.head()

Note: We are treating the `train` dataset as the "entire" dataset because it incorporates the `Survived` column and allows for practicing splitting data into "test" and "train" in Part 3/8, below.

In [3]:
# 2/8: Clean and standardize the input data

data = train.drop(labels = ["PassengerId"], axis=1)
data = data.drop(labels = ["Name"], axis=1)
data = data.drop(labels = ["Sex"], axis=1)
data = data.drop(labels = ["Ticket"], axis=1)
data = data.drop(labels = ["Fare"], axis=1)
data = data.drop(labels = ["Cabin"], axis=1)
data = data.drop(labels = ["Embarked"], axis=1)

nan_value = "NaN"
data.replace("", nan_value, inplace=True)
data = data.dropna()

#print(data.count())
#print(data.head())

In [4]:
# 3/8: Split data into training / testing

data_train, data_test = train_test_split(data, train_size=0.6, random_state=123)

X = data_train.drop(['Survived'], axis=1).values
y = data_train['Survived'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state=123)

#print(data.shape)
#print(data_train.shape)
#print(data_test.shape)

#print(X_train.shape)
#print(X_test.shape)
#print(y_train.shape)
#print(y_test.shape)

In [5]:
# 4/8: Perform regularization
# Not necessary

In [6]:
# 5/8: Construct ML model to the training set and explain why the algorithm should be used

model_RF = RandomForestClassifier(n_estimators=20, criterion="gini").fit(X_train, y_train)

Since the Random Forest method combines the results of building multiple decision trees by merging them together, it is able to achieve more accurate predictions. This is why the Random Forest method is a good choice among the Supervised Machine Learning methods.

In [7]:
# 6/8: Apply ML model to predict the output from the testing set

y_pred_RF = model_RF.predict(X_test)

In [8]:
# 7/8: Evaluate the output using a method from "Chapter 4"

metrics.accuracy_score(y_test, y_pred_RF)

0.6395348837209303

##### 8/8: Assess whether the ML model is good or bad

   This implementation of the Random Forest method does not perform very well, as seen by the above "accuracy score". This is likely due to the smaller training set (due to initial restrictions, the filtering due to missing data, and user-determined exclusion of some data columns). The accuracy of this model could likely be improved by incorporating more data with more variety. However, this would likely require more upfront processing of the data so that the data columns were interpretable by the model.