# Machine Learning Project

>The goal of this project is to create a machine learning model to analyze megaline's clientel and recommend one of their new plans: Ultra or Smart. <br> <br> In the Statistical Data Analysis project, we completed <b> Step 1: Getting the data </b> and <b> Step 2: Clean, Prepare and Manipulate Data.</b> Now it is time to do <b>Step 3: Training the Model.</b>

In [2]:
import pandas as pd
import sys
import warnings
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

## 1. Viewing Data

In [3]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


## 2. Spliting into test/train/validation set

In [4]:
features = ['calls', 'minutes', 'messages', 'mb_used'] # inputting features
target = ['is_ultra'] # target

df_train, df_temp = train_test_split(df, test_size=0.4)
df_test, df_valid = train_test_split(df_temp, test_size=0.5)

features_train = df_train.drop(['is_ultra'], axis=1) # drop is_ultra so it doesn't count twice 
target_train = df_train['is_ultra'] # target

features_valid = df_valid.drop(['is_ultra'], axis=1) # drop is_ultra so it doesn't count twice
target_valid = df_valid['is_ultra'] # target

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

print("trained features:")
print(features_train.head(5))
print('')
print("trained targets:")
print(target_train.head(5))

trained features:
      calls  minutes  messages   mb_used
925    61.0   399.98      24.0  17553.71
934    70.0   450.03      51.0  17567.13
2081   76.0   535.01      20.0  12272.83
2867   70.0   443.01       9.0  13834.84
1171   49.0   344.92      17.0  23383.40

trained targets:
925     0
934     0
2081    0
2867    0
1171    0
Name: is_ultra, dtype: int64


### Conclusion

>Since we are trying to recommend a new plan to megaline's clients, our target is ```is_ultra```. We call the rest of the columns features. We then split the data into two variables ```df_train``` and ```df_temp```. ```df_temp``` was created so it could be split again to make the ``df_test`` and ```df_valid``` variables. We then move on to train the model.

## 3. Fitting the data set and finding the best accuracy

### Tuning Hyperparameters V1

In [50]:
dtc = DecisionTreeClassifier(random_state=0, max_depth=17, criterion='gini')
dtc.fit(features_train, target_train)
dtc_predictions1 = dtc.predict(features_train)

rfc = RandomForestClassifier(max_depth=25, random_state=42, min_samples_split = 2, min_samples_leaf = 1)
rfc.fit(features_train, target_train)
rfc_predictions1 = rfc.predict(features_train)

lr = LogisticRegression(random_state=12345, solver='liblinear')
lr.fit(features_train, target_train)
lr_predictions1 = lr.predict(features_train)

print("Tuning Hyperparameters: Version 1")
print("Decision Tree Classifier Accuracy:", accuracy_score(target_train, dtc_predictions1))
print("Random Forest Classifier Accuracy:", accuracy_score(target_train, rfc_predictions1))
print("Logistic Regression Accuracy:", accuracy_score(target_train, lr_predictions1))
print("")

dtc_predictions_v = dtc.predict(features_valid)
rfc_predictions_v = rfc.predict(features_valid)
lr_predictions_v = lr.predict(features_valid)
print("Decision Tree Classifier Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, dtc_predictions_v))
print("RandomForest Classifier Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, rfc_predictions_v))
print("Logistic Regression Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, lr_predictions_v))

Tuning Hyperparameters: Version 1
Decision Tree Classifier Accuracy: 0.9610995850622407
Random Forest Classifier Accuracy: 0.9813278008298755
Logistic Regression Accuracy: 0.700207468879668

Decision Tree Classifier Validation Accuracy
Validation set: 0.7511664074650077
RandomForest Classifier Validation Accuracy
Validation set: 0.80248833592535
Logistic Regression Validation Accuracy
Validation set: 0.7247278382581649




### Tuning Hyperparameters V2

In [51]:
dtc = DecisionTreeClassifier(random_state=12345, max_depth=25, criterion='entropy')
dtc.fit(features_train, target_train)
dtc_predictions2 = dtc.predict(features_train)

rfc = RandomForestClassifier(max_depth=20, random_state=12345, min_samples_split = 3, min_samples_leaf = 1)
rfc.fit(features_train, target_train)
rfc_predictions2 = rfc.predict(features_train)

lr = LogisticRegression(random_state=42, solver='liblinear')
lr.fit(features_train, target_train)
lr_predictions2 = lr.predict(features_train)

print("Tuning Hyperparameters: Version 2")
print("Decision Tree Classifier Accuracy:", accuracy_score(target_train, dtc_predictions2))
print("Random Forest Classifier Accuracy:", accuracy_score(target_train, rfc_predictions2))
print("Logistic Regression Accuracy:", accuracy_score(target_train, lr_predictions2))
print("")
dtc_predictions_v2 = dtc.predict(features_valid)
rfc_predictions_v2 = rfc.predict(features_valid)
lr_predictions_v2 = lr.predict(features_valid)
print("Decision Tree Classifier Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, dtc_predictions_v2))
print("RandomForest Classifier Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, rfc_predictions_v2))
print("Logistic Regression Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, lr_predictions_v2))

Tuning Hyperparameters: Version 2
Decision Tree Classifier Accuracy: 0.9953319502074689
Random Forest Classifier Accuracy: 0.966804979253112
Logistic Regression Accuracy: 0.700207468879668

Decision Tree Classifier Validation Accuracy
Validation set: 0.7387247278382582
RandomForest Classifier Validation Accuracy
Validation set: 0.7978227060653188
Logistic Regression Validation Accuracy
Validation set: 0.7247278382581649




### Tuning Hyperparameters V3

In [53]:
dtc = DecisionTreeClassifier(random_state=0, max_depth=17, criterion='gini')
dtc.fit(features_train, target_train)
dtc_predictions3 = dtc.predict(features_train)

rfc = RandomForestClassifier(max_depth=10, random_state=40, min_samples_split = 3, min_samples_leaf = 1)
rfc.fit(features_train, target_train)
rfc_predictions3 = rfc.predict(features_train)

lr = LogisticRegression(random_state=12345, solver='liblinear')
lr.fit(features_train, target_train)
lr_predictions3 = lr.predict(features_train)

print("Tuning Hyperparameters: Version 3")
print("Decision Tree Classifier Accuracy:", accuracy_score(target_train, dtc_predictions3))
print("Random Forest Classifier Accuracy:", accuracy_score(target_train, rfc_predictions3))
print("Logistic Regression Accuracy:", accuracy_score(target_train, lr_predictions3))
print("")
dtc_predictions_v3 = dtc.predict(features_valid)
rfc_predictions_v3 = rfc.predict(features_valid)
lr_predictions_v3 = lr.predict(features_valid)
print("Decision Tree Classifier Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, dtc_predictions_v3))
print("RandomForest Classifier Vaildation Accuracy")
print("Validation set:", accuracy_score(target_valid, rfc_predictions_v3))
print("Logistic Regression Validation Accuracy")
print("Validation set:", accuracy_score(target_valid, lr_predictions_v3))

Tuning Hyperparameters: Version 3
Decision Tree Classifier Accuracy: 0.9610995850622407
Random Forest Classifier Accuracy: 0.8827800829875518
Logistic Regression Accuracy: 0.700207468879668

Decision Tree Classifier Validation Accuracy
Validation set: 0.7511664074650077
RandomForest Classifier Vaildation Accuracy
Validation set: 0.8087091757387247
Logistic Regression Validation Accuracy
Validation set: 0.7247278382581649




### Conclusion

> The RandomForest classifier from V2 will give us the highest accuracy! This isn't surprising as RandomForest is known for it's high accuracy. While we could have chosen the Decision Tree Classifier because it had a higher test accuracy, it performed lower on the validation set. Let's move on and test our model.

## 4. Checking the quality of the model

In [58]:
rfc_predictionsv2 = rfc.predict(features_test)
print("RandomForest Classifier Accuracy")
print("Test set:", accuracy_score(target_test, rfc_predictionsv2))
print("")
print(rfc_predictionsv2)

RandomForest Classifier Accuracy
Test set: 0.7838258164852255

[1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0
 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 1 0
 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1
 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1
 1 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 0 1 1 0
 1 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0
 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 

### Conclusion

> After testing our model, we see that the test set has a 78% accuracy!

## Sanity Check

> Will our validation set have the same accuracy?

In [59]:
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(features_test, target_test)
DummyClassifier(strategy='most_frequent')
dummy_clf.predict(features_test)
dummy_clf.score(features_test, target_test)

zero = 0
one = 1
for i in rfc_predictions:
    if i == 0:
        zero+=1
    if i == 1:
        one+=1

print("zero: ",zero)
print("one: ", one)
print(zero/len(rfc_predictionsv2))

zero:  490
one:  154
0.7620528771384136


> Not as accurate as the test set, but is well above the threshold for accuracy.

## 5. Overall Conclusion

> We trained our training set using the RandomForest Classifier because it would give us a high accuracy (it did: 98%), and tested our model on the testing set which is what really mattered. The testing set resulted in an 78% accuracy which satisfies the threshold. We then moved on to sanity checking our model. This resulted in a lower but acceptable accuracy not too far from our testing set accuracy. 