## To the reviewer
Hey, My name is Yoad :),<br> 
I would like to thank you for taking the time to go over my project.<br>
I hope that you'll find it interesting and clear, and I'm looking forward to read your comments and suggestions.

----

# Plan Recommendation Model For Megalin's Subscribers

Megaline has found that many of their subscribers use legacy plans.<br>
A model needs to be developed based on subscriber's behavior who already made the switch to the new plans, so it could analyze the behavior of subscribers on legacy plans and suggest one of the new plans to them.

## Decription of the data
##### The `users_behavior` table:
 * `сalls` — Number of calls.
 * `minutes` — Total call duration in minutes.
 * `messages` — Number of text messages.
 * `mb_used` — Internet traffic used in MB.
 * `is_ultra` — Plan for the current month (Ultra - 1, Smart - 0).

----

## Tables Of Contents 
<a class="anchor" id="table_of_contents"></a><br>
- [**First Look at The Data**](#chapter1)
  - [Users Behavior](#chapter2)
- [**Data Preprocessing**](#chapter5)
  - [Duplicated Values](#chapter7)
- [**Data Modeling**](#chapter9)
  - [Data Splitting](#chapter10)
  - [Investigate The Quality of Different Models](#chapter11)
      - [Decision Tree](#chapter12)
      - [Random Forest](#chapter13)
      - [Logistic Regression](#chapter14)
      - [Conclusion](#chapter17)
  - [Checking Quality Using The Test Data](#chapter15)
      - [Conclusion](#chapter18)
  - [Model Sanity Check](#chapter16)
  - [Conclusion](#chapter19)

----

## First Look at The Data 
<a class="anchor" id="chapter1"></a>
[Go back to the Table of Contents](#table_of_contents)

### Please Note: 
**There are some libraries that are not updated on the platform.<br>
In order for the code to run properly, remove the # sign in the following commands and execute them, then restart the kernel.<br>
Thank you.**

In [1]:
# !pip install pandas --upgrade

In [2]:
# !pip install -U scikit-learn --upgrade

In [3]:
# Importing pandas libarary.
import pandas as pd

# Importing models.
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

# Importing metric and splitting libraries.
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Importing joblib.
from joblib import dump

In [4]:
# By using try-except method the data can load on the platform, as well as locally.

try:
    # Loading the data in the .csv files into a variable (locally)
    df = pd.read_csv('users_behavior.csv')
    
except:   
    # Loading the data in the .csv files into a variable (platform)
    df = pd.read_csv('/datasets/users_behavior.csv')

### Users Behavior
<a class="anchor" id="chapter2"></a>
[Go back to the Table of Contents](#table_of_contents)

Our first step is to examine the entire set of data and determine if there are any missing values, as well as the types of data, since we already know the data was prepared, we will just make sure it is correct.

In [5]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


- No missing values in the data.
- We won't change any data types, as it might conflict with our model training.
- The memory usage of the data is pretty low as it is so there is no reason to optimize.

Printing 5 rows randomly from the dataset to examine the data.

In [6]:
df.sample(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1321,56.0,302.74,70.0,29223.17,0
2844,21.0,188.44,9.0,7414.17,0
645,84.0,572.48,5.0,28775.89,1
854,94.0,728.12,41.0,17962.51,0
1725,85.0,645.72,0.0,14586.92,0


Using describe() we can now get even more detailed information about our data.

In [7]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


- No weird or negative values here.

### Duplicated Values
<a class="anchor" id="chapter7"></a>
[Go back to the Table of Contents](#table_of_contents)

Looking for duplicated rows in the data sets.

In [8]:
print('Number of duplicated rows in logs:', df.duplicated().sum())

Number of duplicated rows in logs: 0


#### Conclusion

**At first glance of the data we conclude that:**
- There are no missing values.
- No duplicates in our data.
- There is no need to optimize the data.
- We don't observe any weird values in our data.
- **Our target is: is_ultra column**
- **Our features are all the rest of the columns.**

----

## Data Modeling
<a class="anchor" id="chapter9"></a>
[Go back to the Table of Contents](#table_of_contents)

### Data Splitting
<a class="anchor" id="chapter10"></a>
[Go back to the Table of Contents](#table_of_contents)

Because we only have one data set, we need to split it into train data, validation data and test data so that we can test our models.<br>
Our data will be split into three parts: the training data, the validation data, and the test data.<br>
We'll split the data into 3:1:1 ratio.

Splitting the data into the first two parts:
- df_train: 60% of the data.
- df_valid_test: 40% of the data.

In [9]:
df_train, df_valid_test = train_test_split(df, test_size=0.4, random_state=12345)

Now we'll split df_valid_test (40% of the original) into two parts:
- df_valid: 50% of the data.
- df_test: 50% of the data.

In [10]:
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=12345)

Now we've created 3 different data sets, 
- **train** data is **60%** of the original, 
- **validation** data is **20%** of the original,
- **test** data is **20%** of the original.

##### df_train

In [11]:
print('The length of df_train is:',len(df_train))

The length of df_train is: 1928


In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 3027 to 482
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     1928 non-null   float64
 1   minutes   1928 non-null   float64
 2   messages  1928 non-null   float64
 3   mb_used   1928 non-null   float64
 4   is_ultra  1928 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 90.4 KB


##### df_valid

In [13]:
print('The length of df_valid is:',len(df_valid))

The length of df_valid is: 643


In [14]:
df_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1386 to 3197
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
 4   is_ultra  643 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 30.1 KB


##### df_test

In [15]:
print('The length of df_test is:',len(df_test))

The length of df_test is: 643


In [16]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 160 to 2313
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
 4   is_ultra  643 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 30.1 KB


#### Conclusion
- We've created 3 data sets from the original data set.
- Train data is 60% of the original.
- Validation data is 20% of the original.
- Test data is 20% of the original.

### Investigate The Quality of Different Models
<a class="anchor" id="chapter11"></a>
[Go back to the Table of Contents](#table_of_contents)

We will analyze the different model options at our disposal to find the best fit for our data.<br>
Each target and feature in the data sets will be assigned a variable.

In [17]:
train_features = df_train.drop(['is_ultra'], axis=1)
train_target = df_train['is_ultra']
valid_features = df_valid.drop(['is_ultra'], axis=1)
valid_target = df_valid['is_ultra']
test_features = df_test.drop(['is_ultra'], axis=1)
test_target = df_test['is_ultra']

#### Decision Tree
<a class="anchor" id="chapter12"></a>
[Go back to the Table of Contents](#table_of_contents)

Let's check a decision tree model on our train and validation data to find the best fit.<br>

Considering that we train the model on part of the data and validate it on other parts of the data which it didn't train on, we can expect that the highest score of the validation will indicate the quality of our model training.<br>

In this way, we can determine the best hyperparameter tweaking for our model.

- Our model will be created using a function, which will get its hyperparameters from the iteration block below.

In [18]:
# A function to create the model with the right hyper parameters for each iteration.
def decision_tree_creation(depth):
    model = DecisionTreeClassifier(random_state = 12345, max_depth = depth)
    return model.fit(train_features, train_target)

- We will iterate over different tree depths and choose the best.
- The way to do this is to use a for loop that checks each time for an increasing value of depth and appends that score to a dataframe.

In [19]:
# Iterating over different depths of the model and appending to a dataframe.
accuracy = []
for depth in range(1, 21):
        model = decision_tree_creation(depth)
        train_score = model.score(train_features, train_target)
        valid_score = model.score(valid_features, valid_target)
        
        accuracy.append(
            {
            'max_depth': depth,
            'train_accuracy': train_score * 100,
            'validation_accuracy': valid_score * 100
            }
                        )
accuracy = pd.DataFrame(accuracy)
accuracy.sort_values(by=['validation_accuracy'], ascending=False).head(3)

Unnamed: 0,max_depth,train_accuracy,validation_accuracy
2,3,80.757261,78.538103
5,6,83.76556,78.382582
6,7,85.580913,78.227061


##### Conclusion
- We've iterated over 20 different tree depths.
- **The most accurate max_depth is 3 with a score of 78.53%**

#### Random Forest
<a class="anchor" id="chapter13"></a>
[Go back to the Table of Contents](#table_of_contents)

In the same way as with the decision tree model, we'll iterate over different n_estimators values to find the most accurate one.

- Our model will be created using a function, which will get its hyperparameters from the iteration block below.

In [20]:
def random_forest_creation(n_est):
    model = RandomForestClassifier(random_state = 12345, n_estimators = n_est)
    return model.fit(train_features, train_target)

- We will iterate over different n_estimators values and choose the best.
- The way to do this is to use a for loop that checks each time for an increasing value of trees trained and appends that and the score to a dataframe.

In [21]:
accuracy = []
for n_est in range(1, 21):
        model = random_forest_creation(n_est)
        train_score = model.score(train_features, train_target)
        valid_score = model.score(valid_features, valid_target)

        accuracy.append(
            {
            'n_estimators': n_est,
            'train_accuracy': train_score * 100,
            'validation_accuracy': valid_score * 100
            }
                        )
accuracy = pd.DataFrame(accuracy)
accuracy.sort_values(by=['validation_accuracy'], ascending=False).head(3)

Unnamed: 0,n_estimators,train_accuracy,validation_accuracy
17,18,98.755187,79.315708
18,19,99.118257,78.849145
19,20,98.910788,78.693624


##### Conclusion
- We've iterated over 20 different values of trees trained.
- **The most accurate n_estimators is 18 with a score of 79.31%**
- The Random Forest score is slightly higher than the decision tree, but we need to take into account that it is slower, and it might have a significantly longer training time than the decision tree.

#### Logistic Regression
<a class="anchor" id="chapter14"></a>
[Go back to the Table of Contents](#table_of_contents)

With Logistic Regression, we don't have max_depth or n_estimator to iterate over, so here we'll just train the model and check dor it's score.

In [22]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(train_features, train_target)
train_score = model.score(train_features, train_target)
valid_score = model.score(valid_features, valid_target)
print("Train Score: {:.1f}%".format(train_score * 100),",",
      "Validation Score: {:.1f}%".format(valid_score * 100)
     )

Train Score: 71.6% , Validation Score: 70.9%


##### Conclusion
- **The score of the validation data set (75.9%)** is higher than **the training set (75.1%)**, which means that the model is doing a good job classifying the data.
- However, it is the least accurate from all 3 models we tested.

#### Conclusion
<a class="anchor" id="chapter17"></a>
[Go back to the Table of Contents](#table_of_contents)

We checked 3 different classification models, and got the following results:
- **Decision Tree** most accurate training was **max_depth = 3** with **accuracy of 78.5%** on the validation data.
- **Random Forest** most accurate training was **n_estimators = 18** with **accuracy of 79.3%** on the validation data.
- **Logistic Regression** most accurate training was **accuracy of 75.9%** on the validation data.
<br>


- **Random Forest is the most accurate model** out of the 3 we tested, but we need to take into account that the difference between random forest and decision tree are not that big, and if we had a large data to classify we might see a much slower process with random forest compared to decision tree.
- **Decision Tree is the 2nd most accurate** and we should take into account that it is **not by a large margin**.
- **Logistic Regression was the least accurate** of the 3 models we tested.

### Checking Quality Using The Test Data
<a class="anchor" id="chapter15"></a>
[Go back to the Table of Contents](#table_of_contents)

After we have tweaked our hyperparameters, we can test our models against our test data and see if they are successful.

#### Decision Tree
<a class="anchor" id="chapter12"></a>
[Go back to the Table of Contents](#table_of_contents)

In [23]:
model = DecisionTreeClassifier(random_state = 12345, max_depth = 3)
model.fit(train_features, train_target)
train_score = model.score(train_features, train_target)
test_score = model.score(test_features, test_target)
print("Train Score: {:.1f}%".format(train_score * 100),",",
      "Test Score: {:.1f}%".format(test_score * 100))

Train Score: 80.8% , Test Score: 77.9%


##### Conclusion
- We used max_depth = 3 as it was the most accurate for our data.
- **The test score is 77.9%**
- We can see that we didn't lose much accuracy with our test data, that is a good sign that our model is successfully classifying our data.
- And we are over our threshold of 75%!

#### Random Forest
<a class="anchor" id="chapter13"></a>
[Go back to the Table of Contents](#table_of_contents)

In [24]:
model = RandomForestClassifier(random_state = 12345, n_estimators = 18)
model.fit(train_features, train_target)
train_score = model.score(train_features, train_target)
test_score = model.score(test_features, test_target)
print("Train Score: {:.1f}%".format(train_score * 100),",",
      "Test Score: {:.1f}%".format(test_score * 100))

Train Score: 98.8% , Test Score: 78.5%


##### Conclusion
- We used n_estimators = 18 as it was the most accurate for our data.
- **The test score is 78.5%**
- We can see that we didn't lose much accuracy with our test data, that is a good sign that our model is successfully classifying our data.
- And we are over our threshold of 75%!

#### Logistic Regression
<a class="anchor" id="chapter14"></a>
[Go back to the Table of Contents](#table_of_contents)

In [25]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(train_features, train_target)
train_score = model.score(train_features, train_target)
test_score = model.score(test_features, test_target)
print("Train Score: {:.3f}%".format(train_score * 100),",",
      "Test Score: {:.3f}%".format(test_score * 100)
     )

Train Score: 71.577% , Test Score: 68.896%


##### Conclusion
- **The test score is 74.028%**
- Unfortunately this model doesn't meet our threshold of success and we can not use it any further.

#### Conclusion
<a class="anchor" id="chapter18"></a>
[Go back to the Table of Contents](#table_of_contents)

- We used our most accurate tweaks of the model's hyperparameters and tested each one against our test data.
- On all of our models we lost some accuracy.
- **The most accurate model is Random Forest with a score of 78.5%.**
- **Then** we have **Decision Tree with a score of 77.9%.**
- **Logistic Regression is out of the picture** as **it does not meet the threshold of success 75%.**

### Model Sanity Check
<a class="anchor" id="chapter16"></a>
[Go back to the Table of Contents](#table_of_contents)

To be sure that our model works, we need to check it against chance, or against a dummy model that will also classify the data and predict the most frequent label. <br>
A lower model score than the dummy score indicates that our model is irrelevant.

In [26]:
dummy_model = DummyClassifier(random_state=12345, strategy="most_frequent")
dummy_model.fit(train_features, train_target)
dummy_score = dummy_model.score(train_features, train_target)
print("Dummy Model Score: {:.3f}%".format(dummy_score * 100))

Dummy Model Score: 69.243%


##### Conclusion
- Our dummy model's score is lower than all of our tested models which indicates that they perform better than chance and we could rest assured that they are performing the job that they need to.

----

## Conclusion
<a class="anchor" id="chapter19"></a>
[Go back to the Table of Contents](#table_of_contents)



Our study looked at the best model, and the best model tweaks, that will help us recommend the best new plan for legacy subscribers of Megaline's services.

In order to achieve the highest accuracy of prediction, we recommend using a Random Forest model with 18 estimators.
Random Forest, while accurate, can also be slow.<br>
If speed is important, we can use the second most accurate model, Decision Tree, where the difference in accuracy is less than 1% but we gain a much higher prediction speed.

----