Hello, my name is Artem. I'm going to review your project!

You can find my comments in <font color='green'>green</font>, <font color='blue'>blue</font> or <font color='red'>red</font> boxes like this:

<div class="alert alert-block alert-success">
<b>Success:</b> if everything is done succesfully
</div>

<div class="alert alert-block alert-info">
<b>Improve: </b> "Improve" comments mean that there are tiny corrections that could help you to make your project better.
</div>

<div class="alert alert-block alert-danger">
<b>Needs fixing:</b> if the block requires some corrections. Work can't be accepted with the red comments.
</div>

### <font color='orange'>General feedback</font>
* I'm glad to say that you executed your project really well.
* Your project has passed code-review. Congratulations!
* You've achieved good score! Well done!
* Parameters tuning was done correctly!
* You can make your project even better by working on the "improve" comments.
* You're on the right track. Keep it up!

<b>Machine Learning Model Prediction for Megaline Phone Company</b>

<p>Company: Megaline</p>

<p>Taregt: To develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.</p>

<p>Threshold of accuracy: 0.75%</p>

<b>Table of contents</b>
1. [Step 1. Open the data files and study the general information](#step1)
2. [Step 2. Split the source data into training, validation, and test sets](#step2)
3. [Step 3. Investigate the quality of different models by changing hyperparameters](#step3)
4. [Step 4. Check the quality of the model using the test set](#step4)
5. [Step 5. Sanity check the model](#step5)
6. [Overall Conclusion](#conclusion)

<b>Step 1. Open the data file and study the general information. <a name="step1"></a></b>

In [140]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#import models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

<div class="alert alert-block alert-success">
<b>Success:</b> Thank you for collecting all imports in the first cell!
</div>

In [90]:
try:
    user_behavior = pd.read_csv("/datasets/users_behavior.csv")
except:
    print("Could not read file:")
    sys.exit()

def get_information(df):
    print('Head:')
    print()
    display(df.head())
    print('-'*100)
    print('Info:')
    print()
    display(df.info())
    print('-'*100)
    print('Describe:')
    print()
    display(df.describe())
    print()
    print('Columns with nulls:')
    display(get_percent_of_na_df(df,4))
    print('-'*100)
    print('Shape:')
    print(df.shape)
    print('-'*100)
    print('Duplicate:')
    print('We have {} duplicated rows.'.format(df.duplicated().sum()))
    
def get_percent_of_na_df(df,num):
    df_nulls = pd.DataFrame(df.isna().sum(),columns=['Missing Values'])
    df_nulls['Percent of Nulls'] = round(df_nulls['Missing Values'] / df.shape[0],num) *100
    return df_nulls

def get_percent_of_na(df):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count == 1
        print('Column {} has {.{}%} percent of nulls, and {} of nulls'.format(column, percent,num,num_of_nulls))
    
    if count !=0:
        print('There are {} columns with NA'.format(count))
    else:
        print()
        print('There are no colums with NA')
        
get_information(user_behavior)

Head:



Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0



Columns with nulls:


Unnamed: 0,Missing Values,Percent of Nulls
calls,0,0.0
minutes,0,0.0
messages,0,0.0
mb_used,0,0.0
is_ultra,0,0.0


----------------------------------------------------------------------------------------------------
Shape:
(3214, 5)
----------------------------------------------------------------------------------------------------
Duplicate:
We have 0 duplicated rows.


<b>Conclusion</b>

<p>There are no missing values or duplicated rows in the data.</p>
<p>The features of the data set are calls, minutes, messages, and mb_used. The target that we are trying to predict is the 'is_ultra' column. The Ultra phone plan is assigned a 1 and the Smart phone plan is assigned a 0. These values represent classification. For our model we can use logistic regression, decision trees, or random forests.</p>

<div class="alert alert-block alert-success">
<b>Success:</b> Data loading and initial analysis were done well!
</div>

In [91]:
#Find the number of each plan
user_behavior.is_ultra.value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

<p>There are twice as many Smart plans as Ultra plans.</p>

<div class="alert alert-block alert-success">
<b>Success:</b> Glad to see that you've noticed that classes are imbalanced.
</div>

In [92]:
#Assign the features and target of the model
features = user_behavior.drop(['is_ultra'], axis=1)
target = user_behavior[['is_ultra']]

<b>Step 2. Split the source data into training, validation, and test sets</b><a name="step2"></a></b>

In [93]:
#Train: 60% of user_behavior data set
#Validation: 20% of user_behavior data set
#Test: 20% of user_behavior data set

features1, features_test, target1, target_test = train_test_split(features, target, test_size=0.2, random_state=42)
features_train, features_val, target_train, target_val = train_test_split(features1,target1, test_size=0.25, random_state=42)

In [94]:
print("Train set:", np.array([features_train.shape[0]])/user_behavior.shape[0])

Train set: [0.59987554]


In [95]:
print("Validation set:", np.array([features_val.shape[0]])/user_behavior.shape[0])

Validation set: [0.20006223]


In [96]:
print("Test set:", np.array([features_test.shape[0]])/user_behavior.shape[0])

Test set: [0.20006223]


<b>Conclusion</b>
<p>The train, validation, and test sets make up the appropriate portion of the source data set.</p>

<div class="alert alert-block alert-success">
<b>Success:</b> Data was split correctly. Glad to see that you've checked yourself.
</div>

<b>Step 3. Investigate the quality of different models by changing hyperparameters</b><a name="step3"></a></b>



<b>Model: DecisionTreeClassifier</b>

In [97]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)
accuracy = model.score(features_val, target_val)*100
print('Accuracy:', accuracy.round(), '%')

Accuracy: 72.0 %


<p>The accuracy for this test is not high enough because the threshold for accuracy is 0.75. To get a better performance on the model we tune hyperparameters.</p>

In [98]:
#Tune Max Depth to get a better performance on the DecisionTreeClassifier model
for i in range(1,21):
    model = DecisionTreeClassifier(max_depth=i, random_state=12345)
    model.fit(features_train, target_train)
    print('Max Depth ' + str(i)+': ' + str((model.score(features_val, target_val)*100).round(1))+'%')


Max Depth 1: 74.2%
Max Depth 2: 77.4%
Max Depth 3: 77.4%
Max Depth 4: 78.1%
Max Depth 5: 77.1%
Max Depth 6: 78.1%
Max Depth 7: 78.8%
Max Depth 8: 77.4%
Max Depth 9: 77.8%
Max Depth 10: 77.0%
Max Depth 11: 77.0%
Max Depth 12: 76.5%
Max Depth 13: 75.9%
Max Depth 14: 76.4%
Max Depth 15: 76.4%
Max Depth 16: 76.0%
Max Depth 17: 74.2%
Max Depth 18: 72.9%
Max Depth 19: 73.7%
Max Depth 20: 73.1%


In [152]:
#Tune criterion to get a better performance on the DecisionTreeClassifier model
model = DecisionTreeClassifier(random_state=12345, criterion="entropy")
model.fit(features_train, target_train.values.ravel())
entropy = model.score(features_val, target_val)*100
print('Entropy ', entropy.round(1), '%')

Entropy  70.8 %


<b>Conclusion</b>
<p>The highest accuracy of the model is when the max depth of the decision tree is 7. The accuracy is 78.8%.<p>

<p>Changing the criterion from gini to entropy decreases the orginal accuracy of the model from 72% to 71%. Therefore, it is best to use the default value of criterion</p>

<div class="alert alert-block alert-success">
<b>Success:</b> Parameters tuning was done in the right way.
</div>

<b>Model: RandomForestClassifier</b>

In [100]:
rf_model = RandomForestClassifier(random_state=12345)
rf_model.fit(features_train, target_train.values.ravel())
rf_accuracy = rf_model.score(features_val, target_val)*100
print('Random Forest Accuracy:', rf_accuracy.round(1), '%')


Random Forest Accuracy: 77.1 %


In [101]:
#Tune Max Depth to get a better performance on the RandomForestClassifier model
for i in range(1,21):
    rf_model = RandomForestClassifier(max_depth=i, random_state=12345)
    rf_model.fit(features_train, target_train.values.ravel())
    print('Max Depth ' + str(i)+': ' + str((rf_model.score(features_val, target_val)*100).round(1))+'%')

Max Depth 1: 75.3%
Max Depth 2: 76.7%
Max Depth 3: 78.4%
Max Depth 4: 79.0%
Max Depth 5: 79.0%
Max Depth 6: 79.2%
Max Depth 7: 79.3%
Max Depth 8: 79.2%
Max Depth 9: 78.1%
Max Depth 10: 79.0%
Max Depth 11: 77.8%
Max Depth 12: 78.5%
Max Depth 13: 77.4%
Max Depth 14: 78.1%
Max Depth 15: 76.0%
Max Depth 16: 78.2%
Max Depth 17: 77.0%
Max Depth 18: 77.4%
Max Depth 19: 77.4%
Max Depth 20: 77.4%


In [102]:
#Tune n_estimators to get a better performance on the RandomForestClassifier model
for i in range(20, 500, 20):
    rf_model = RandomForestClassifier(random_state=12345, n_estimators=i, max_depth=7)
    rf_model.fit(features_train, target_train.values.ravel())
    print('Number Estimators ' + str(i)+': ' + str((rf_model.score(features_val, target_val)*100).round(2))+'%')

Number Estimators 20: 80.4%
Number Estimators 40: 80.4%
Number Estimators 60: 80.25%
Number Estimators 80: 80.56%
Number Estimators 100: 80.87%
Number Estimators 120: 80.56%
Number Estimators 140: 80.87%
Number Estimators 160: 80.72%
Number Estimators 180: 80.87%
Number Estimators 200: 80.4%
Number Estimators 220: 80.4%
Number Estimators 240: 80.25%
Number Estimators 260: 80.25%
Number Estimators 280: 80.09%
Number Estimators 300: 80.09%
Number Estimators 320: 80.09%
Number Estimators 340: 80.25%
Number Estimators 360: 80.25%
Number Estimators 380: 80.25%
Number Estimators 400: 80.25%
Number Estimators 420: 80.4%
Number Estimators 440: 80.25%
Number Estimators 460: 80.56%
Number Estimators 480: 80.56%


<b>Conclusion</b>
<p>The Random Forest Accuracy gave a higher accuarcy than the  DecisionTreeClassifier without any changes in the hyperparameters. Therefore, I chose this model to use over the DecisionTreeClassifier to achieve the greatest accuracy.</p>
<p>By tuning the max_depth hyperparameter, the RandomForestClassifier has a greater accuracy at a Max Depth of 7 nodes at 79.3%.</p>
<p>By tuning the n_estimators hyperparameter, the RandomForestClassifer has a greater accuracy when the number of estimators equals 100, 140, and 180 at 80.87%</p>

<b>Model: LogisticRegression</b>

In [103]:
lr_model = LogisticRegression()
lr_model.fit(features_train, target_train.values.ravel())
lr_accuracy = lr_model.score(features_val, target_val)*100
print('Logistic Regression Accuracy:', lr_accuracy.round(1), '%')

Logistic Regression Accuracy: 72.2 %


In [104]:
#Tune solver to get a better performance on the LogisitcRegression model
lr_model = LogisticRegression(solver='lbfgs')
lr_model.fit(features_train, target_train.values.ravel())
solver = lr_model.score(features_val, target_val)*100
print('Solver ', solver.round(1), '%')

Solver  72.0 %


<b>Conclusion</b>
<p>The LogisticRegression accuracy is lower than the RandomForestClassifier even after tuning the solver hyperparameter.</p>

<div class="alert alert-block alert-success">
<b>Success:</b> Great that you've tried several models.
</div>

<b>Step 4. Check the quality of the model using the test set</b><a name="step4"></a></b>

In [105]:
#Quality of the Decision Tree Classifer using the test set
model = DecisionTreeClassifier(max_depth=7, random_state=12345)
model.fit(features_train, target_train)
print('Accuracy:', (model.score(features_test, target_test).round(2)*100), '%')

Accuracy: 79.0 %


In [106]:
#Quality of the RandomForestClassifer using the test set
rf_model = RandomForestClassifier(random_state=12345, n_estimators=140, max_depth=7)
rf_model.fit(features_train, target_train.values.ravel())
print('Accuracy:', (rf_model.score(features_test, target_test).round(2)*100), '%')

Accuracy: 81.0 %


In [107]:
#Quality of the LogisticRegression using the test set
lr_model = LogisticRegression()
lr_model.fit(features_train, target_train.values.ravel())
print('Accuracy:', (lr_model.score(features_test, target_test).round(2)*100), '%')

Accuracy: 70.0 %


<b>Conclusion</b>
<p>Using the test set with the max_depth of 7 and n_estimators of 140 returns the highest accuracy of 81% for the RandomForestClassifier model. The RandomForestClassifier model has a higher accuracy than the DecisionTreeClassifier and the LogisticRegression models. The test set is similar to the real world because it is data that hasn't been used in building the model.</p>

<div class="alert alert-block alert-success">
<b>Success:</b> Testing was done absolutely right!
</div>

<b>Step 5. Sanity check the model</b><a name="step5"></a></b>

In [138]:
#Dummy classifier to predict everything as zeros
sanity = target_test * 0

In [150]:
print('Sanity Test:', f1_score(target_test, sanity))

Sanity Test: 0.0


In [141]:
confusion_matrix(target_test, sanity)

array([[455,   0],
       [188,   0]])

In [151]:
print('Accuracy:', (accuracy_score(target_test, sanity)*100).round(), '%')

Accuracy: 71.0 %


<b>Conclusion</b>
<p>The sanity test value is 0 because the model is predicting everything to be 0. By looking at the confusion_matrix, we can see that the model accurately predicted 455 values to be 0, but inaccurately predicted that 188 values were 1. This gives us the accuracy value of 71%.<p>

<div class="alert alert-block alert-success">
<b>Success:</b> Great!
</div>

<div class="alert alert-block alert-info">
<b>Improve: </b> Next time you could use <a href="https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html"> DummyClassifier </a>.
</div>

<b>Overall Conclusion<a name="conclusion"></a></b>
<p>There are various machine learning models that can be used to predict a particular outcome. For Megaline, the best model to use is the RandomTreeClassifier because it produces the highest accuracy at 81% to predict the right plan for the customers based on the features of their current plan.</p>

<div class="alert alert-block alert-info">
<b>Improve: </b> It would be better if overall conclusion would be more detailed.
</div>