- In this project, we are given train.csv in which we have different features of person like age, ticket fare, sex, name etc.. and label whether they survived or not.
- We have to make predictions then after on test set whether the person survived or not if all features are given.
- We will use Decision Tree to do this.

### Data Preprocessing:

In [1]:
import pandas as pd, numpy as np

In [4]:
df = pd.read_csv("./Train/Train.csv")
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,3.0,0.0,"O'Donoghue, Ms. Bridget",female,,0.0,0.0,364856,7.75,,Q,,,
1,2.0,0.0,"Morley, Mr. Henry Samuel (""Mr Henry Marshall"")",male,39.0,0.0,0.0,250655,26.0,,S,,,
2,2.0,1.0,"Smith, Miss. Marion Elsie",female,40.0,0.0,0.0,31418,13.0,,S,9,,
3,3.0,1.0,"Goldsmith, Mrs. Frank John (Emily Alice Brown)",female,31.0,1.0,1.0,363291,20.525,,S,C D,,"Strood, Kent, England Detroit, MI"
4,3.0,1.0,"McCoy, Miss. Agnes",female,,2.0,0.0,367226,23.25,,Q,16,,


- We are given several features and some of these are useless as name has nothing to do whether a person survived or not. But age matters as during time of accident, children may have given top priority. Similarly females are given high priority for evacuation, so 'sex' is relavent and people with higher ticket fare may also have given higher priority. And passenger class is also same as fare: means higher the class=> more priority given
- So we will do some feature engineering by removing some features

In [5]:
# pclass:          Ticket class   1 = 1st, 2 = 2nd, 3 = 3rd

# sex:              Sex   

# Age:                Age in years   

# sibsp:        # of siblings / spouses aboard the Titanic  

# parch:       # of parents / children aboard the Titanic  

# ticket:        Ticket number   

# fare:         Passenger fare  

# cabin:         Cabin number   

# embarked:    Port of Embarkation=>    C = Cherbourg, Q = Queenstown, S = Southampton

- No. of siblings and parents can also matter as more the siblings or parents, more they will help each other. So for now I will drop a few columns. But you can check your accuracy with including them also. Bcoz sometimes some factor plays very major role which we might be missing and not considering into account.

In [6]:
df = df.drop(["name","cabin","embarked","home.dest","body","ticket"], axis=1) ## I have no idea what body is ?

- ticket id was also not relavent. I have no idea what boat and body column represent. So let's remove them for now. Or we can add boat column 

#### Convert string data to numeric data
- Here sex and boat column have string values. So we can convert them in numerical value using sklearn.preprocessing.LabelEncoder() class. But for now let me try it by myself.

In [10]:
df["sex"]=="female"

0        True
1       False
2        True
3        True
4        True
        ...  
1004    False
1005     True
1006     True
1007    False
1008    False
Name: sex, Length: 1009, dtype: bool

In [11]:
sex_mapping = {}  ## create dictionary to store mapping of all unique values in this  

i = 0
for value in np.unique(df["sex"]):
    sex_mapping[value] = i  ## key is string and value is numeric value
    
    #now convert all ith unique value to 'i'
    df["sex"].loc[df["sex"]==value] = i     
    i += 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [12]:
print(sex_mapping)
print(df["sex"])

{'female': 0, 'male': 1}
0       0
1       1
2       0
3       0
4       0
       ..
1004    1
1005    0
1006    0
1007    1
1008    1
Name: sex, Length: 1009, dtype: object


- We have converted our data into numeric data.
- We can create a general function like this.

- In case of "boat", there are different data types in a single column. So np.unique() will give error. So we use **astype("str") to convert every data type to string and then do everything else**

In [15]:
def str_to_numeric(df, col_name):
    mapping = {}  ## to create a mapping    
    
    i = 0
    for value in np.unique(df[col_name].astype("str")):
        if value.lower()=="nan":   ## if value is NaN or Nan or nan then skip below code
            continue
        mapping[value] = i  ## key is string and value is numeric value

    #now convert all ith unique value to 'i'
        df[col_name].loc[df[col_name]==value] = i 
    
        i += 1
    return mapping ## don;t return df as changes here are made in actual address of df and not to some copy

In [16]:
mapping = str_to_numeric(df, "boat")
mapping

{'1': 0,
 '10': 1,
 '11': 2,
 '12': 3,
 '13': 4,
 '13 15': 5,
 '13 15 B': 6,
 '14': 7,
 '15': 8,
 '16': 9,
 '2': 10,
 '3': 11,
 '4': 12,
 '5': 13,
 '5 7': 14,
 '5 9': 15,
 '6': 16,
 '7': 17,
 '8': 18,
 '9': 19,
 'A': 20,
 'B': 21,
 'C': 22,
 'C D': 23,
 'D': 24}

In [17]:
df

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,boat
0,3.0,0.0,0,,0.0,0.0,7.7500,
1,2.0,0.0,1,39.0,0.0,0.0,26.0000,
2,2.0,1.0,0,40.0,0.0,0.0,13.0000,19
3,3.0,1.0,0,31.0,1.0,1.0,20.5250,23
4,3.0,1.0,0,,2.0,0.0,23.2500,9
...,...,...,...,...,...,...,...,...
1004,1.0,1.0,1,40.0,0.0,0.0,31.0000,17
1005,3.0,0.0,0,37.0,0.0,0.0,9.5875,
1006,1.0,1.0,0,23.0,1.0,0.0,113.2750,16
1007,3.0,1.0,1,12.0,1.0,0.0,11.2417,22


- We can do this above all process in a single step also by creating **LabelEncoder** object

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1009 entries, 0 to 1008
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1009 non-null   float64
 1   survived  1009 non-null   float64
 2   sex       1009 non-null   object 
 3   age       812 non-null    float64
 4   sibsp     1009 non-null   float64
 5   parch     1009 non-null   float64
 6   fare      1008 non-null   float64
 7   boat      374 non-null    object 
dtypes: float64(6), object(2)
memory usage: 63.2+ KB


- We can see there are only 374/1009 values. All others are NaN. So filling mean of these 374 to others will only make predictions around the mean as 633/1009 will have mean value. So let us drop this column also
- We can drop it in the beginning also. But my aim was to show conversion of string to numeric data for this column. That's why I didn't deleted it earlier.

In [23]:
df = df.drop("boat", axis=1)

#### Fill NaN values now

In [25]:
df["age"] = df["age"].fillna(df["age"].mean())   ## fill mean for all NaN values
df["fare"] = df["fare"].fillna(0)  ## fill 0 in this case as there is only 1 NaN value here
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare
0,3.0,0.0,0,29.838978,0.0,0.0,7.75
1,2.0,0.0,1,39.0,0.0,0.0,26.0
2,2.0,1.0,0,40.0,0.0,0.0,13.0
3,3.0,1.0,0,31.0,1.0,1.0,20.525
4,3.0,1.0,0,29.838978,2.0,0.0,23.25


In [27]:
df.columns

Index(['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'fare'], dtype='object')

In [35]:
## divide in X_train and y_train
X_train = df[["pclass", "sex", "age", "sibsp", "parch", "fare"]] ## pass list of columns in index
y_train = df["survived"]  ## you can pass col_name directly instead of list for single column
X_train.shape, y_train.shape

((1009, 6), (1009,))

#### Data preparation complete! Train model now

## Training
- We will do training with random forest as it trains on a lot of trees and give average result of all of trees. Bcoz a single tree may give poor results sometimes and some other tree can give better results. So we train a lot of trees and find average of results.

In [58]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=20,criterion='entropy',max_depth=7) # create object

In [59]:
rf.fit(X_train,y_train)  ## do the training 

RandomForestClassifier(criterion='entropy', max_depth=7, n_estimators=20)

In [60]:
rf.score(X_train, y_train)

0.8592666005946482

## Testing

In [61]:
df2 = pd.read_csv("./Test/Test.csv")
df2.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Flynn, Mr. John Irwin (""Irving"")",male,36.0,0.0,0.0,PC 17474,26.3875,E25,S,5.0,,"Brooklyn, NY"
1,3.0,"Sage, Miss. Constance Gladys",female,,8.0,2.0,CA. 2343,69.55,,S,,,
2,1.0,"Rood, Mr. Hugh Roscoe",male,,0.0,0.0,113767,50.0,A32,S,,,"Seattle, WA"
3,2.0,"Gillespie, Mr. William Henry",male,34.0,0.0,0.0,12233,13.0,,S,,,"Vancouver, BC"
4,2.0,"Collander, Mr. Erik Gustaf",male,28.0,0.0,0.0,248740,13.0,,S,,,"Helsinki, Finland Ashtabula, Ohio"


In [62]:
## drop some columns to match with X_train
df2 = df2.drop(["name","cabin","embarked","home.dest","body","ticket", "boat"], axis=1)

In [66]:
## now convert sex column to numeric data with same mapping with which we converted in train data
for k in sex_mapping.keys():
    df2["sex"].loc[ df2["sex"] == k] = sex_mapping[k]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [68]:
df2.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare
0,1.0,1,36.0,0.0,0.0,26.3875
1,3.0,0,,8.0,2.0,69.55
2,1.0,1,,0.0,0.0,50.0
3,2.0,1,34.0,0.0,0.0,13.0
4,2.0,1,28.0,0.0,0.0,13.0


In [69]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  300 non-null    float64
 1   sex     300 non-null    object 
 2   age     234 non-null    float64
 3   sibsp   300 non-null    float64
 4   parch   300 non-null    float64
 5   fare    300 non-null    float64
dtypes: float64(5), object(1)
memory usage: 14.2+ KB


In [70]:
## fill NaN values in age column as average values
df2["age"] = df2["age"].fillna(df2["age"].mean())
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  300 non-null    float64
 1   sex     300 non-null    object 
 2   age     300 non-null    float64
 3   sibsp   300 non-null    float64
 4   parch   300 non-null    float64
 5   fare    300 non-null    float64
dtypes: float64(5), object(1)
memory usage: 14.2+ KB


- We can;t check score on testing data as we do not have y_labels for testing data. So let us see predictions and submit our predictions on assignment and see the testing accuracy there.

In [73]:
y_pred = rf.predict(df2)
y_pred

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 0., 0.,
       0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0.,
       0., 1., 1., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 0., 0., 1., 0.,
       0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0.,
       0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0.,
       0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 1.,
       0., 0., 1., 1., 0.

In [75]:
results = pd.DataFrame(y_pred, columns=["survived"]) # convert to dataframe to convert to csv file
results.to_csv("Output.csv", index_label="Id")

### Got 80% accuracy on testing data. That is a good accuracy for Decision tree
- Try adding other features also. They may increase accuracy