# **5G User Prediction with Machine Learning and Deep Learning methods.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_sample = pd.read_csv('/kaggle/input/dataset/sample.csv')
df_train = pd.read_csv('/kaggle/input/dataset/train.csv')
df_test = pd.read_csv('/kaggle/input/dataset/test.csv')
df_train.shape, df_test.shape, df_sample.shape

In [None]:
df_train.head()

In [None]:
df_train.isnull().sum().sum(), df_test.isnull().sum().sum()

There are not any missing elements on both datasets, that's nice.

In [None]:
df_train.columns

In [None]:
df_test.columns

We are going to make a prediction about ***is_5g*** column. The sample dataset has the correct results for ***is_5g*** column of test dataset.

Now, I will try to look for connections for 5G users and see which categories are more important to the decision process.

In [None]:
sns.set(rc={'figure.figsize':(25,6)})
sns.barplot(x="chnl_type", y="is_5g", hue='product_type', data=df_train)

In [None]:
sns.barplot(x="service_type", y="is_5g", hue='product_type', data=df_train)

In [None]:
sns.barplot(x="product_type", y="is_5g", data=df_train)

Channel, Service and Product types are determinant factors.

In [None]:
sns.barplot(x="term_type", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="age", y="is_5g", hue='sex', data=df_train)

In [None]:
sns.barplot(x="manu_name", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="max_rat_flag", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="is_5g_base_cover", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="is_work_5g_cover", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="activity_type", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="is_act_expire", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="comp_type", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="city_5g_ratio", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="city_level", y="is_5g", data=df_train)

In [None]:
sns.barplot(x="prov_id", y="is_5g", data=df_train)

So far, I believe I have gathered enough information about dataset. I have looked at every column and their effect to the 5G.
In short, I am going to drop these columns: ***area_id, innet_months, total_times, total_flux, total_fee, pay_fee, age, activity_type, game_app_flux, live_app_flux, video_app_flux AND also the days.***

In [None]:
df = df_train.drop(['area_id', 'innet_months', 'total_times', 'total_flux', 'total_fee', 'pay_fee', 'age', 
                    'activity_type', 'game_app_flux', 'live_app_flux', 'video_app_flux', 'bank_cnt', 'call_days',
                   're_call10', 'short_call10', 'long_call10'], axis=1)

In [None]:
df_wo_days = df.drop(['active_days01', 'active_days02', 'active_days03', 'active_days04', 'active_days05',
                      'active_days06', 'active_days07', 'active_days08', 'active_days09', 'active_days10',
                      'active_days11', 'active_days12', 'active_days13', 'active_days14', 'active_days15',
                      'active_days16', 'active_days17', 'active_days18','active_days19', 'active_days20',
                      'active_days21', 'active_days22', 'active_days23'], axis=1)

Now, our data looks like this.

In [None]:
df_wo_days.head()

Before we start with the predictions, we need to prepare our test dataset as well.

I will drop user_id before the training process. Beucause it is a challenging type and I do not want to mess with it so much. I will add that column back later on.

In [None]:
user_id = df_test['user_id']
df_wo_days = df_wo_days.drop(['user_id'], axis=1)

In [None]:
df_test = df_test.drop(['area_id', 'innet_months', 'total_times', 'total_flux', 'total_fee', 'pay_fee', 'age', 
                    'activity_type', 'game_app_flux', 'live_app_flux', 'video_app_flux', 'bank_cnt', 'call_days',
                   're_call10', 'short_call10', 'long_call10','active_days01', 'active_days02', 'active_days03', 
                        'active_days04', 'active_days05','active_days06', 'active_days07', 'active_days08',
                        'active_days09', 'active_days10','active_days11', 'active_days12', 'active_days13', 
                        'active_days14', 'active_days15','active_days16', 'active_days17', 'active_days18',
                        'active_days19', 'active_days20','active_days21', 'active_days22', 'active_days23'], axis=1)

Starting now...

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [None]:
x = df_wo_days.drop(['is_5g'], axis=1)
y = df_wo_days['is_5g']
df_wo_days.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

**LOGISTIC REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(x_train, y_train)
prediction_lr = logistic.predict(x_test)
print(classification_report(y_test,prediction_lr))

**DECISION TREE**

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(x_train, y_train)
prediction_dt = tree.predict(x_test)
print(classification_report(y_test, prediction_dt))

* Precision: %99
* Recall: %99
* Accuracy: %98

**RANDOM FOREST**

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(x_train, y_train)
prediction_rf = forest.predict(x_test)
print(classification_report(y_test, prediction_rf))

* Precision: %99
* Recall: %100
* Accuracy: %99

**XGBOOST**

In [None]:
import xgboost
xgb = xgboost.XGBClassifier()
xgb.fit(x_train,y_train)
prediction_xgb = xgb.predict(x_test)
print(classification_report(y_test, prediction_xgb))

* Precision: %99
* Recall: %100
* Accuracy: %99

**NEURAL NETWORK**

In [None]:
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.layers import Dropout
df_wo_days.shape

In [None]:
model = Sequential([
    Dense(32, activation='relu', input_dim=19),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
model.summary()

In [None]:
model.fit(x_train, y_train, batch_size=32, epochs=10,verbose=2)

In [None]:
prediction_nn = model.predict(x_test)
prediction_nn = [1 if y>=0.5 else 0 for y in prediction_nn]
print(classification_report(y_test, prediction_nn))

* Precision: %99
* Recall: %100
* Accuracy: %99

**Predictions with the validation set look great. Now we will look at the test data.**

In [None]:
df_wo_days.shape, df_test.shape

In [None]:
x.shape, y.shape, df_test.shape

In [None]:
df_test = df_test.drop(['user_id'], axis=1)

In [None]:
logistic.fit(x, y)
test_lr = logistic.predict(df_test)

In [None]:
forest.fit(x, y)
test_rf = forest.predict(df_test)

In [None]:
tree.fit(x, y)
test_dt = tree.predict(df_test)

In [None]:
xgb.fit(x,y)
test_xgb = xgb.predict(df_test)

In [None]:
test_lr.shape, test_rf.shape, test_dt.shape, test_xgb.shape

In [None]:
dt_pred = pd.DataFrame(test_dt, columns= ['5G-DT'])
rf_pred = pd.DataFrame(test_rf, columns= ['5G-RF'])
lr_pred = pd.DataFrame(test_lr, columns= ['5G-LR'])
xgb_pred = pd.DataFrame(test_xgb, columns= ['5G-XGB'])

In [None]:
dt_pred.reset_index(inplace=True, drop=True)
rf_pred.reset_index(inplace=True, drop=True)
lr_pred.reset_index(inplace=True, drop=True)
xgb_pred.reset_index(inplace=True, drop=True)

In [None]:
df_fin = pd.concat([user_id, lr_pred, rf_pred, dt_pred, xgb_pred], axis=1)

In [None]:
df_fin.head()

Checking to see how our different methods performed on the dataset.

In [None]:
actual_result = df_sample['is_5g']

In [None]:
df_fin = pd.concat([df_fin, actual_result], axis=1)

In [None]:
df_fin.head()

Now, I will count the true and false values that models predicted. Then I will look at the success rate of the models.

In [None]:
true_lr = 0
false_lr = 0
true_rf = 0
false_rf = 0
true_dt = 0
false_dt = 0
true_xgb = 0
false_xgb = 0

In [None]:
for i in range(0, len(df_fin)-1):
    if test_dt[i] == actual_result[i]:
        true_dt += 1
    else:
        false_dt += 1
print('Prediction Results for Decision Tree')
print('Correct Predictions: {:d} - False Predictions: {:d}'.format(true_dt, false_dt))

In [None]:
for i in range(0, len(df_fin)-1):
    if test_xgb[i] == actual_result[i]:
        true_xgb += 1
    else:
        false_xgb += 1
print('Prediction Results for XGBoost')
print('Correct Predictions: {:d} - False Predictions: {:d}'.format(true_xgb, false_xgb))

In [None]:
for i in range(0, len(df_fin)-1):
    if test_lr[i] == actual_result[i]:
        true_lr += 1
    else:
        false_lr += 1
print('Prediction Results for Logistic Regression')
print('Correct Predictions: {:d} - False Predictions: {:d}'.format(true_lr, false_lr))

In [None]:
for i in range(0, len(df_fin)-1):
    if test_rf[i] == actual_result[i]:
        true_rf += 1
    else:
        false_rf += 1
print('Prediction Results for Random Forest')
print('Correct Predictions: {:d} - False Predictions: {:d}'.format(true_rf, false_rf))

I have checked on the test data by using sample data outputs. It seems I had great results with different models.
Now we will look at them on a graph.

In [None]:
acc_lr = (true_lr+1)/len(df_fin)*100
acc_rf = (true_rf+1)/len(df_fin)*100
acc_dt = (true_dt+1)/len(df_fin)*100
acc_xgb = (true_xgb+1)/len(df_fin)*100
acc_tot = acc_lr, acc_rf, acc_dt, acc_xgb
labels = 'Regression', 'Random Forest', 'Decision Tree', 'XGBoost'

In [None]:
plt.plot(labels, acc_tot)
plt.ylabel('Accuracy percentage')
plt.xlabel('Machine Learning methods')
plt.title('5G')

**It turns out I have achieved %100 success rate with linear regression.**

# **We have come to an end to our notebook. I hope you find it informative and fun. Thank you!**

<img src="https://media.giphy.com/media/xUPOqo6E1XvWXwlCyQ/giphy.gif">