# DECISION TREE

Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

In [33]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

from pydataset import data

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix 


import matplotlib.pyplot as plt
import seaborn as sns

import acquire
import prepare

## STEP 1: Plan 
 - Let's Examine the Titanic DataSet
 - Can we accurately predict the survival of passengers on the Titanic based on categorical data, such as age, gender, passenger class, or fare.  

## STEP 2: Acquire
 - Acquire the data we have cleaned and prepped using our previous funtions.

In [2]:
df = acquire.get_titanic_data()

df.head()

Reading from csv file...


Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


## STEP 3: Prepare

In [3]:
df = prepare.prep_titanic(df)

In [4]:
df.head()

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,embark_town,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,male,1,0,7.25,Southampton,0,1,0,1
1,1,1,female,1,0,71.2833,Cherbourg,0,0,0,0
2,1,3,female,0,0,7.925,Southampton,1,0,0,1
3,1,1,female,1,0,53.1,Southampton,0,0,0,1
4,0,3,male,0,0,8.05,Southampton,1,1,0,1


In [5]:
df = df.drop(columns=["sex", "embark_town"])
df.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,1,0,7.25,0,1,0,1
1,1,1,1,0,71.2833,0,0,0,0
2,1,3,0,0,7.925,1,0,0,1
3,1,1,1,0,53.1,0,0,0,1
4,0,3,0,0,8.05,1,1,0,1


Prepare - Split Data

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
def train_validate_test_split(df, target, seed=123):
    train_validate, test = train_test_split(df, test_size=0.2, 
                                            random_state=seed, 
                                            stratify=df[target])
    train, validate = train_test_split(train_validate, test_size=0.3, 
                                       random_state=seed,
                                       stratify=train_validate[target])
    print(f'train --> {train.shape}')
    print(f'validate --> {validate.shape}')
    print(f'test --> {test.shape}')
    
    return train, validate, test

In [8]:
train, validate, test = train_validate_test_split(df, 'survived', seed=123)

train --> (498, 9)
validate --> (214, 9)
test --> (179, 9)


In [9]:
train.shape

(498, 9)

In [10]:
validate.shape

(214, 9)

In [11]:
test.shape

(179, 9)

In [12]:
train, validate, test = train_validate_test_split(df, target='survived', seed=123)

X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

train --> (498, 9)
validate --> (214, 9)
test --> (179, 9)


In [13]:
X_train.head()

Unnamed: 0,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,1,0,0,40.125,1,1,0,0
165,3,0,2,20.525,0,1,0,1
50,3,4,1,39.6875,0,1,0,1
259,2,0,1,26.0,0,0,0,1
306,1,0,0,110.8833,1,0,0,0


In [14]:
X_train.shape

(498, 8)

In [15]:
X_validate.shape

(214, 8)

In [16]:
X_test.shape

(179, 8)

In [17]:
y_train.value_counts()

0    307
1    191
Name: survived, dtype: int64

## STEP 4: EXPLORATION / PRE-PROCESSING
- Done previously

## STEP 5: Modeling

### Baseline

1.  a. What is your baseline prediction? 

    b. What is your baseline accuracy? 
    - remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [18]:
y_train[0:10]

583    0
165    1
50     0
259    1
306    1
308    0
314    0
883    0
459    0
180    0
Name: survived, dtype: int64

In [19]:
#obtain our mode (most occuring outcome)
train.survived.value_counts()

# baseline assumption = Did NOT survive

0    307
1    191
Name: survived, dtype: int64

In [20]:
# Obtain the mode for the target
baseline = y_train.mode()

# produce boolean array with True assigned to match the baseline prediction and real data. 
matches_baseline_prediction = (y_train == 0)

baseline_accuracy = matches_baseline_prediction.mean()

print(f'Baseline Accuracy: {round(baseline_accuracy, 2)}')

Baseline Accuracy: 0.62


### Fit - Transform
2. Fit the decision tree classifier to your training sample and transform 
- (i.e. make predictions on the training sample)

In [21]:
# Make the model
clf1 = DecisionTreeClassifier(max_depth=1, random_state=123)

#Fit the model (on train and only train)
clf1 = clf1.fit(X_train, y_train)

# Use the model
# We'll evaluate the model's performance on train first

y_predictions = clf1.predict(X_train)


In [22]:
print(export_text(clf1, feature_names=X_train.columns.tolist()))

|--- sex_male <= 0.50
|   |--- class: 1
|--- sex_male >  0.50
|   |--- class: 0



### Evaluate Performance

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [23]:
tree_score = clf1.score(X_train, y_train)
print(f'Accuracy of Decision Tree Classifier on trianing set: {tree_score:.2f}')

Accuracy of Decision Tree Classifier on trianing set: 0.80


In [24]:
pd.DataFrame(confusion_matrix(y_train, y_predictions))

Unnamed: 0,0,1
0,265,42
1,58,133


In [25]:
# Create a string object
classification_report(y_train, y_predictions)

'              precision    recall  f1-score   support\n\n           0       0.82      0.86      0.84       307\n           1       0.76      0.70      0.73       191\n\n    accuracy                           0.80       498\n   macro avg       0.79      0.78      0.78       498\nweighted avg       0.80      0.80      0.80       498\n'

In [26]:
# Creates a dataframe based off a dictionary

classification = classification_report(y_train, y_predictions, output_dict=True)
pd.DataFrame(classification).transpose()

Unnamed: 0,precision,recall,f1-score,support
0,0.820433,0.863192,0.84127,307.0
1,0.76,0.696335,0.726776,191.0
accuracy,0.799197,0.799197,0.799197,0.799197
macro avg,0.790217,0.779764,0.784023,498.0
weighted avg,0.797255,0.799197,0.797358,498.0


### Additional - Calculate Metrics by Hand

4. Compute: 
    - Accuracy, 
    - true positive rate, 
    - false positive rate, 
    - true negative rate, 
    - false negative rate, 
    - precision, 
    - recall, 
    - f1-score, and 
    - support.

In [27]:
# Positives - Did NOT survive

TP = 276 
FP = 58
FN = 42
TN = 133
ALL = TP + FP + FN + TN

accuracy = (TP + TN) / ALL
print(f"Accuracy: {accuracy:.2f}")
      
true_positive_rate = TP / (TP + FN)
print(f"True Positive Rate: {true_positive_rate:.2f}")
      
false_positive_rate = FP /(FP + TN)
print(f"False Positive Rate: {false_positive_rate:.2f}")
      
true_negative_rate = TN / (TN + FP)
print(f"True Negative Rate: {true_negative_rate:.2f}")
      
false_negative_rate = FN / (FN + TP)
print(f"False Negative Rate: {false_negative_rate:.2f}")

precision = TP / (TP + FP)
print(f"Precision: {precision:.2f}")
            
recall = TP / ( TP + FN)
print(f"Recall: {recall:.2f}")

f1_score = 2 *(precision*recall) / (precision+recall)
print(f"F1: {f1_score:.2f}")
       
support_pos = TP + FN
print(f"F1: {support_pos:.2f}")
      
support_neg = FP + TN
print(f"F1: {support_neg:.2f}")


Accuracy: 0.80
True Positive Rate: 0.87
False Positive Rate: 0.30
True Negative Rate: 0.70
False Negative Rate: 0.13
Precision: 0.83
Recall: 0.87
F1: 0.85
F1: 318.00
F1: 191.00


In [28]:
classification = classification_report(y_train, y_predictions, output_dict=True)
pd.DataFrame(classification).transpose()

Unnamed: 0,precision,recall,f1-score,support
0,0.820433,0.863192,0.84127,307.0
1,0.76,0.696335,0.726776,191.0
accuracy,0.799197,0.799197,0.799197,0.799197
macro avg,0.790217,0.779764,0.784023,498.0
weighted avg,0.797255,0.799197,0.797358,498.0


5. Run through steps 2-4 using a different max_depth value.

In [29]:
for i in range(2,10):

    tree = DecisionTreeClassifier(max_depth=i, random_state=123)

    tree = tree.fit(X_train, y_train)

    y_predictions = tree.predict(X_train)
    
    classification = classification_report(y_train, y_predictions, output_dict=True)

    print(f'Tree with a max depth of {i}')
    print(pd.DataFrame(classification).transpose())
    print("___________________")


Tree with a max depth of 2
              precision    recall  f1-score     support
0              0.820433  0.863192  0.841270  307.000000
1              0.760000  0.696335  0.726776  191.000000
accuracy       0.799197  0.799197  0.799197    0.799197
macro avg      0.790217  0.779764  0.784023  498.000000
weighted avg   0.797255  0.799197  0.797358  498.000000
___________________
Tree with a max depth of 3
              precision    recall  f1-score     support
0              0.828829  0.899023  0.862500  307.000000
1              0.812121  0.701571  0.752809  191.000000
accuracy       0.823293  0.823293  0.823293    0.823293
macro avg      0.820475  0.800297  0.807654  498.000000
weighted avg   0.822421  0.823293  0.820430  498.000000
___________________
Tree with a max depth of 4
              precision    recall  f1-score     support
0              0.829341  0.902280  0.864275  307.000000
1              0.817073  0.701571  0.754930  191.000000
accuracy       0.825301  0.825301  0.82

6. Which model performs better on your in-sample data?

The model with a max depth of 9 performs best with an accuracy of 91%. As the depth increases, the accuracy increases, which aslo demonstrates how models can be overfit to the data. 

### Validation
7. Which model performs best on your out-of-sample data, the validate set?

In [30]:
metrics = []

for i in range (2, 20):
        # Make the model
        tree = DecisionTreeClassifier(max_depth=i, random_state=123)
        
        #Fit the model on TRAIN only)
        tree = tree.fit(X_train, y_train)
        
        #Use the model - on train first, then on validate
        in_sample_accuracy = tree.score(X_train, y_train)
        
        out_of_sample_accuracy = tree.score(X_validate, y_validate)
        
        output = {
            "max_depth": i,
            "train_accuracy": in_sample_accuracy,
            "validate_accuracy": out_of_sample_accuracy
        }
        
        metrics.append(output)
        
df = pd.DataFrame(metrics)
df['difference'] = df.train_accuracy - df.validate_accuracy

df
    

Unnamed: 0,max_depth,train_accuracy,validate_accuracy,difference
0,2,0.799197,0.761682,0.037515
1,3,0.823293,0.785047,0.038246
2,4,0.825301,0.785047,0.040254
3,5,0.837349,0.757009,0.08034
4,6,0.859438,0.766355,0.093083
5,7,0.863454,0.761682,0.101772
6,8,0.89759,0.757009,0.140581
7,9,0.909639,0.761682,0.147956
8,10,0.923695,0.766355,0.15734
9,11,0.931727,0.761682,0.170045


In [31]:
df[df.difference <= 0.10].sort_values(by=['validate_accuracy', 'difference'])

Unnamed: 0,max_depth,train_accuracy,validate_accuracy,difference
3,5,0.837349,0.757009,0.08034
0,2,0.799197,0.761682,0.037515
4,6,0.859438,0.766355,0.093083
1,3,0.823293,0.785047,0.038246
2,4,0.825301,0.785047,0.040254


# RANDOM FORESTS

### Fit - Transform

1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

In [36]:
# Make the model
forest1 = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)

# Fit the model (on train and only train)
forest1.fit(X_train, y_train)

# Use the model
# We'll evaluate the model's performance on train, first
y_predictions = forest1.predict(X_train)

# Produce the classification report on the actual y values and this model's predicted y values
report = classification_report(y_train, y_predictions, output_dict=True)
print("Tree of 10 depth")
pd.DataFrame(report)

Tree of 10 depth


Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.931889,0.965714,0.943775,0.948801,0.944862
recall,0.980456,0.884817,0.943775,0.932636,0.943775
f1-score,0.955556,0.923497,0.943775,0.939526,0.94326
support,307.0,191.0,0.943775,498.0,498.0


### Evaluate Performance

2. Evaluate your results using the model score, confusion matrix, and classification report.

- See Below

3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [38]:
# sklearn confusion matrix
pd.DataFrame(confusion_matrix(y_predictions, y_train))

Unnamed: 0,0,1
0,301,22
1,6,169


In [40]:
y_pred1 = forest1.predict(X_train)
forest_score = forest1.score(X_train, y_train)
conf = confusion_matrix(y_train, y_pred1)
tpr = conf[1][1] / conf[1].sum()
fpr = conf[0][1] / conf[0].sum()
tnr = conf[0][0] / conf[0].sum()
fnr = conf[1][0] / conf[1].sum()
print(f'''
    The accuracy for our model is {forest_score:.4}
    The True Positive Rate is {tpr:.3}, The False Positive Rate is {fpr:.3},
    The True Negative Rate is {tnr:.3}, and the False Negative Rate is {fnr:.3}
    ''')
pd.DataFrame(classification_report(y_train, y_pred1, output_dict=True))


    The accuracy for our model is 0.9438
    The True Positive Rate is 0.885, The False Positive Rate is 0.0195,
    The True Negative Rate is 0.98, and the False Negative Rate is 0.115
    


Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.931889,0.965714,0.943775,0.948801,0.944862
recall,0.980456,0.884817,0.943775,0.932636,0.943775
f1-score,0.955556,0.923497,0.943775,0.939526,0.94326
support,307.0,191.0,0.943775,498.0,498.0


### Review Metrics

In [None]:
# Positives - Did NOT survive
TP = 169 
FP = 58
FN = 42
TN = 301
ALL = TP + FP + FN + TN

accuracy = (TP + TN) / ALL
print(f"Accuracy: {accuracy:.2f}")
      
true_positive_rate = TP / (TP + FN)
print(f"True Positive Rate: {true_positive_rate:.2f}")
      
false_positive_rate = FP /(FP + TN)
print(f"False Positive Rate: {false_positive_rate:.2f}")
      
true_negative_rate = TN / (TN + FP)
print(f"True Negative Rate: {true_negative_rate:.2f}")
      
false_negative_rate = FN / (FN + TP)
print(f"False Negative Rate: {false_negative_rate:.2f}")

precision = TP / (TP + FP)
print(f"Precision: {precision:.2f}")
            
recall = TP / ( TP + FN)
print(f"Recall: {recall:.2f}")

f1_score = 2 *(precision*recall) / (precision+recall)
print(f"F1: {f1_score:.2f}")
       
support_pos = TP + FN
print(f"F1: {support_pos:.2f}")
      
support_neg = FP + TN
print(f"F1: {support_neg:.2f}")

4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.

In [44]:
#From Lesson Review Example
for i in range(2, 11):
    # Make the model
    forest = RandomForestClassifier(max_depth=i, random_state=123)

    # Fit the model (on train and only train)
    forest = forest.fit(X_train, y_train)

    # Use the model
    # We'll evaluate the model's performance on train, first
    y_predictions = forest.predict(X_train)

    # Produce the classification report on the actual y values and this model's predicted y values
    report = classification_report(y_train, y_predictions, output_dict=True)
    print(f"Tree with max depth of {i}")
    print(pd.DataFrame(report))
    print()

Tree with max depth of 2
                    0           1  accuracy   macro avg  weighted avg
precision    0.774799    0.856000  0.795181    0.815399      0.805942
recall       0.941368    0.560209  0.795181    0.750789      0.795181
f1-score     0.850000    0.677215  0.795181    0.763608      0.783731
support    307.000000  191.000000  0.795181  498.000000    498.000000

Tree with max depth of 3
                    0           1  accuracy   macro avg  weighted avg
precision    0.813754    0.845638  0.823293    0.829696      0.825982
recall       0.925081    0.659686  0.823293    0.792384      0.823293
f1-score     0.865854    0.741176  0.823293    0.803515      0.818036
support    307.000000  191.000000  0.823293  498.000000    498.000000

Tree with max depth of 4
                    0           1  accuracy   macro avg  weighted avg
precision    0.816384    0.875000  0.833333    0.845692      0.838865
recall       0.941368    0.659686  0.833333    0.800527      0.833333
f1-score     

5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

- The test that produces the best metrics is the max depth of 10 at 94.3% accuracy, 32% better than baseline. This is because it is better fit to the training data, havinvg more depth. 

After making a few models, which one has the best performance (or closest metrics) on both train and validate?

In [43]:
# from Lesson Review
metrics = []

for i in range(2, 25):
    # Make the model
    forest = RandomForestClassifier(max_depth=i, random_state=123)

    # Fit the model (on train and only train)
    forest = forest.fit(X_train, y_train)

    # Use the model
    # We'll evaluate the model's performance on train, first
    in_sample_accuracy = forest.score(X_train, y_train)
    
    out_of_sample_accuracy = forest.score(X_validate, y_validate)

    output = {
        "max_depth": i,
        "train_accuracy": in_sample_accuracy,
        "validate_accuracy": out_of_sample_accuracy
    }
    
    metrics.append(output)
    
df = pd.DataFrame(metrics)
df["difference"] = df.train_accuracy - df.validate_accuracy
df

Unnamed: 0,max_depth,train_accuracy,validate_accuracy,difference
0,2,0.795181,0.771028,0.024153
1,3,0.823293,0.775701,0.047592
2,4,0.833333,0.794393,0.038941
3,5,0.845382,0.808411,0.03697
4,6,0.895582,0.808411,0.087171
5,7,0.907631,0.794393,0.113238
6,8,0.921687,0.803738,0.117948
7,9,0.941767,0.78972,0.152047
8,10,0.943775,0.785047,0.158728
9,11,0.945783,0.780374,0.165409


- The model that performed best on out of sample data was the one with a max depth of 6. It had 80.8% accuracy on unseen data, which is 18% better than baseline, while also displaying 90% accuracy on train. It had a small difference between train and validate. 

# K-NEAREST NEIGHBORS

Continue working in your model file with the titanic dataset.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

## STEP 2: Acquire

In [None]:
df = acquire.get_titanic_data()
df.head()

## STEP 3: Prepare - Clean Data

In [None]:
df = prepare.prep_titanic(df)
df.head()

In [None]:
# drop unnecessary data
df = df.drop(columns=["sex", "embark_town"])
df.head()

#### Prepare - Split Data

In [None]:
# call train, validate, test
train, validate, test = train_validate_test_split(df, 'survived', seed=123)

In [None]:
# Verify shape to make sure they are appropriate splits.
train.shape, validate.shape, test.shape

In [None]:
# create X & y version of train, where y is a series with just the target variable and X are all the features. 

X_train = train.drop(columns=['survived'])
y_train = train.survived

X_validate = validate.drop(columns=['survived'])
y_validate = validate.survived

X_test = test.drop(columns=['survived'])
y_test = test.survived

In [None]:
train.head()

## STEP 4: EXPLORATION / PRE-PROCESSING
- Done previously

## STEP 5: MODELING

1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [None]:
# Create KNN Object 
# weights = ['uniform', 'density']
knn1 = KNeighborsClassifier()

In [None]:
# Fit model
knn1.fit(X_train, y_train)

In [None]:
#Make Predictions
y_pred = knn1.predict_proba(X_train[['sex','pclass']])

2. Evaluate your results using the model score, confusion matrix, and classification report.

In [None]:
print('Accuracy of KNN classifier on training set: {:.2f}'
     .format(knn.score(X_train, y_train)))

In [None]:
print(classification_report(y_train, y_pred))

In [None]:
report = classification_report(y_train, y_pred, output_dict=True)
print('n_neightbor = 1')
pd.DataFrame(report)

In [None]:
y_pred_proba = knn.predict_proba(X_train)

3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [None]:
print('Accuracy of KNN classifier on test set: {:.2f}'
     .format(knn.score(X_validate, y_validate)))

4. Run through steps 2-4 setting k to 10

5. Run through setps 2-4 setting k to 20

6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

7. Which model performs best on our out-of-sample data from validate?

# LOGISTIC REGRESSION MODEL

In these exercises, we'll continue working with the titanic dataset and building logistic regression models. 
- Throughout this exercise, be sure you are training, evaluation, and comparing models on the train and validate datasets. 
- The test dataset should only be used for your final model.

- For all of the models you create, choose a threshold that optimizes for accuracy.

## STEP 2: Acquire

In [None]:
df = acquire.get_titanic_data()
df.head()

## STEP 3: Prepare - Clean the data

#### Prepare: Clean Null Values ###

In [None]:
### CHECK FOR NULL VALUES ### 
df.isna().sum()

In [None]:
# we will be using - age -> 177 null values
# What to do with deck?

# Use the average age to full in the null values within the age column.
avg_age = df.age.mean()
df.age = df.age.fillna(avg_age)   

#### Prepare: Encode Data

In [None]:
# Encode string(categorical) values into numberic values so the computer can read them. 

df["is_female"] = (df.sex == 'female').astype('int')

In [None]:
# create dummy vairables to encode embarktown into numberic values
dummy_df = pd.get_dummies(df[['embark_town']], dummy_na=False, drop_first=True)

#reassign df with added colums for embark_town
df = pd.concat([df, dummy_df], axis=1)

In [None]:
# drop unnecessary columns, like columns we used to encode data

df = df.drop(columns=["passenger_id", "deck", "class", "embarked", "sex", "embark_town"])

In [None]:
df.head()

#### Prepare - Split Data

In [None]:
# Split the datasets
train, validate, test = train_validate_test_split(df, 'survived', seed=123)

In [None]:
train.shape, validate.shape, test.shape

#### Prepare - Assign X and y values for splits

In [None]:
# Separate out our X and y values from each dataset
X_train = train.drop(columns=["survived"])
y_train = train.survived

X_validate = validate.drop(columns=["survived"])
y_validate = validate.survived

X_test = test.drop(columns=["survived"])
y_test = test.survived

In [None]:
X_train.head()

## STEP 4: EXPLORATION / PRE-PROCESSING
- Done previously

## STEP 5: MODELING

In [None]:
# identify mode for X variable
train.survived.value_counts()

In [None]:
# Use mode to set baseline

# Obtain the mode for the target
baseline = y_train.mode()

# produce boolean array with True assigned to match the baseline prediction and real data. 
matches_baseline_prediction = (y_train == 0)

baseline_accuracy = matches_baseline_prediction.mean()

print(f'Baseline Accuracy: {round(baseline_accuracy, 2)}')

### EXERCISE 1
1. Create a model that includes age in addition to fare and pclass. 

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
#Use logistic regression

logit = LogisticRegression(random_state=123)

In [None]:
# Set the features we will use (listed in the problem)

features = ['age', 'pclass', 'fare']

In [None]:
# Fit the model using only the features desired.

logit.fit(X_train[features], y_train)

In [None]:
# Predict on the same set of features we fit the model to. 

y_pred = logit.predict(X_train[features])

In [None]:
# Revisit baseline and compare to logistic regression classifier.

print("Baseline is", round(baseline_accuracy, 2))
print("Logistic Regression using age, pclass, and fare features")
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train[features], y_train)))

- Does this model perform better than your baseline?

-- The model performs with an accuracy of .70, which is .08 better than the baseline of .62

### EXERCISE 2
2. Include sex in your model as well. 
- Note that you'll need to encode or create a dummy variable of this feature before including it in a model (previously completed -> is_female)

In [None]:
#Use logistic regression
logit = LogisticRegression(random_state=123)

# Set the features we will use (listed in the problem)
features = ['age', 'pclass', 'fare', 'is_female']

# Fit the model using only the features desired.
logit.fit(X_train[features], y_train)

# Predict on the same set of features we fit the model to. 
y_pred = logit.predict(X_train[features])

# Revisit baseline and compare to logisticc regression classifier.

print("Baseline is", round(baseline_accuracy, 2))
print("Logistic Regression using age, pclass, and fare features")
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train[features], y_train)))

- Does this model perform better than your baseline?

-- The model performs with an accuracy of .81, which is .19 better than the baseline of .62

### EXERCISE 3

3. Try out other combinations of features and models.

### EXERCISE 4

4. Use your best 3 models to predict and evaluate on your validate sample.

### EXERCISE 5

5. Choose your best model from the validation performation, and evaluate it on the test dataset. 
- How do the performance metrics compare to validate? 
- How do the performance metrics compare to train?