<div class="alert alert-success">
<b>Reviewer's comment V3</b>
  
Thank you for taking the time to improve the project! Now it's accepted. Good luck on the next sprint!
    
</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job, but one task is not quite completed. Should be very straightforward to fix though!

# Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible *accuracy*. In this project, the threshold for *accuracy* is **0.75**. Check the *accuracy* using the test dataset. 

## Data description
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
* `сalls` — number of calls,
* `minutes` — total call duration in minutes,
* `messages` — number of text messages,
* `mb_used` — Internet traffic used in MB,
* `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

## Project instructions
### Open and look through the data file. 

In [19]:
pip install -U sidetable

Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: sidetable in /home/jovyan/.local/lib/python3.7/site-packages (0.9.0)
Note: you may need to restart the kernel to use updated packages.


#### Imports

In [20]:
import pandas as pd
import sidetable as stb
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [21]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.head()                 

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [22]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


#### Conclusion
In this section, we:
* saw the table's head
* looked at the statistical characteristics
* checked for data types

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Alright, the data was loaded and inspected!
  
</div>

### Split the source data into a training set, a validation set, and a test set.

In [24]:
df_train, df_valid_test = train_test_split(df, test_size=0.4, random_state=12345)

In [25]:
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=12345)

In [26]:
print('train size: ', round(df_train.shape[0] / df.shape[0], 2), 
      '\nvalidation size: ', round(df_valid.shape[0] / df.shape[0], 2), 
      '\ntest size: ', round(df_test.shape[0] / df.shape[0], 2))

train size:  0.6 
validation size:  0.2 
test size:  0.2


#### Conclusion
We splitted correctly the dataset into three parts: a training set, a validation set, and a test set.

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
The data was split into train, validation and test. The proportions are reasonable.
  
</div>

### Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

#### features and targets

In [27]:
train_features = df_train.drop(['is_ultra'], axis=1)
train_target = df_train['is_ultra']

valid_features = df_valid.drop(['is_ultra'], axis=1)
valid_target = df_valid['is_ultra']

test_features = df_test.drop(['is_ultra'], axis=1)
test_target = df_test['is_ultra']

#### Decision Tree

In [28]:
scores = []

for i in range(1, 11):
    decision_tree = DecisionTreeClassifier(random_state=12345, max_depth=i)
    decision_tree.fit(train_features, train_target)
    score = round(decision_tree.score(valid_features, valid_target),3)
    print('max_depth={}:'.format(i), score)
    scores.append(score)
max_score_max_depth = scores.index(max(scores)) + 1
print('\nThe best score is of max_depth={}:'.format(max_score_max_depth), max(scores))

max_depth=1: 0.754
max_depth=2: 0.782
max_depth=3: 0.785
max_depth=4: 0.779
max_depth=5: 0.779
max_depth=6: 0.784
max_depth=7: 0.782
max_depth=8: 0.779
max_depth=9: 0.782
max_depth=10: 0.774

The best score is of max_depth=3: 0.785


#### Random Forest

In [29]:
scores = []

for i in range(1, 11):
    random_forest = RandomForestClassifier(random_state=12345, n_estimators=i)
    random_forest.fit(train_features, train_target)
    score = round(random_forest.score(valid_features, valid_target),3)
    print('n_estimators={}:'.format(i), score)
    scores.append(score)

max_score_n_estimators = scores.index(max(scores)) + 1
print('\nThe best score is of n_estimators={}:'.format(max_score_n_estimators), max(scores))

n_estimators=1: 0.711
n_estimators=2: 0.764
n_estimators=3: 0.739
n_estimators=4: 0.771
n_estimators=5: 0.75
n_estimators=6: 0.781
n_estimators=7: 0.768
n_estimators=8: 0.782
n_estimators=9: 0.773
n_estimators=10: 0.785

The best score is of n_estimators=10: 0.785


#### Logistic Regression

In [30]:
logistic_regression = LogisticRegression(random_state=12345, solver='liblinear')
logistic_regression.fit(train_features, train_target)
logistic_regression.score(valid_features, valid_target)

0.7589424572317263

#### Conclusion
We Investigated the quality of different models by changing hyperparameters:
* Decision tree (max_depth): the best score is of max_depth=3: 0.785
* Random forest (n_estimators): the best score is of n_estimators=10: 0.785

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>
  
Great, you trained three different models. The task requires you to try at least two different sets of hyperparameters for at least one model though: `Investigate the quality of different models by changing hyperparameters`
  
</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>
  
Awesome, now you tried various hyperparameter values for the two models. One problem is that you the models weren't retrained with selected hyperparameters before evaluating them on the test set, so instead the last model from the loop is evaluated on the test set, for example, instead of decision tree with max_depth=3, the tree with max_depth=10 is evaluated
  
</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>
  
Fixed!
    
</div>

### Check the quality of the model using the test set.

In [31]:
decision_tree = DecisionTreeClassifier(random_state=12345, max_depth=max_score_max_depth)
decision_tree.fit(train_features, train_target)

random_forest = RandomForestClassifier(random_state=12345, n_estimators=max_score_n_estimators)
random_forest.fit(train_features, train_target)

print('Decision tree score: ', round(decision_tree.score(test_features, test_target), 3),
     '\nRandom forest score: ', round(random_forest.score(test_features, test_target), 3),
     '\nLogistic regression score: ', round(logistic_regression.score(test_features, test_target), 3))

Decision tree score:  0.779 
Random forest score:  0.781 
Logistic regression score:  0.74


#### Conclusion
We can see the same models' accuracy scores using the test set:
1. Random forest: 0.781
2. Decision tree: 0.779
3. Logistic regression: 0.74

<div class="alert alert-success">
<s><b>Reviewer's comment</b>
  
Ok, you evaluated the models on the test set!
  
</div>

### Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

In [32]:
df.stb.freq(['is_ultra'], style=True)

Unnamed: 0,is_ultra,count,percent,cumulative_count,cumulative_percent
0,0,2229,69.35%,2229,69.35%
1,1,985,30.65%,3214,100.00%


We can see from the table above, that the values of `is_ultra` column aren't divided equally. There are more than twice 0s than 1s. In this case, we can choose the common value (0) as a default value (predictor). If we do it, we can get correct answers almost 70% of the time without needing any model.

<div class="alert alert-warning">
<b>Reviewer's comment</b>
  
Yep, that's a very good observation! To sanity check the models we need to answer the question: are our models better than this baseline?
  
</div>

#### General Conclusion
We have here a classification task, so we tried three models in order  to choose the model which predicts the best results. 

We divided our data into three sets: train, valid and test.

We checked the models twice using the valid and test sets and we got the accuracy scores.

According to the accuracy scores: random forest is the best model, decision tree is the second and the last is logistic regression.

In addition, we did a sanity check. Our data is far from being random. Without using any model, we can predict the right answer using the common value in most of the cases.

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Conclusions look good
  
</div>

# Project evaluation
We’ve put together the evaluation criteria for the project. Read this carefully before moving on to the task.

Here’s what the reviewers will look at when reviewing your project:
* How did you look into data after downloading?
* Have you correctly split the data into train, validation, and test sets?
* How have you chosen the sets' sizes?
* Did you evaluate the quality of the models correctly?
* What models and hyperparameters did you use?
* What are your findings?
* Did you test the models correctly?
* What is your accuracy score?
* Have you stuck to the project structure and kept the code neat?

You have your takeaway sheets and chapter summaries so you are ready to proceed to the project.

Good luck!