# Sales Prediction based on Social Media Ads

## HImanshu Sharma (DSOCT03)

## Introduction
> Using the fictional dataset of Gender, Age, Salary, Purchased (Target variable), the company wants to know whether a customer will buy its product or not.

## Approach - Supervised Machine Learning (Classification)

### Supervised Machine Learning:

- The algorithms which learns from labeled data. After learning from the the data, the algorithm determines which label should be given to new data by their associating patterns to the unlabeled new data.

- It can be divided into two categories: classification and regression. 

### Classification:

- Classification is a technique which is useful to determining the class based on one or more independent variables.

- I am going through the following classification algorithms in this notebook:

> k-nearest neighbors

> Logistic Regression

> Decision Tree

> Random Forest

>Naive Bayes

## Steps
- Data wrangling, which consists of:
    - Gathering data
    - Assessing data
    - Cleaning data
- Storing, analyzing, and visualizing our wrangled data
- Making Report on
    - Aboout data wrangling efforts and
    - About data analyses and visualizations

In [None]:
#import Useful Library
import pandas as pd
import numpy as np

#for making graph
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#for warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('../input/social-network-ads/Social_Network_Ads.csv')
df.head()

In [None]:
df.info()

In [None]:
df.Purchased.value_counts()

In [None]:
df.Gender.value_counts()

## 1. Perform Basic EDA
### a. Boxplot

In [None]:
sns.boxplot(y='Age', x='Purchased', data=df)

In [None]:
sns.boxplot(y='EstimatedSalary', x='Purchased', data=df)

### b. Histogram – Distribution of Target Variable

In [None]:
plt.hist(x="Purchased", data=df);
plt.title('Distribution of Purchase');
plt.ylabel('Count');
plt.xlabel('Purchase');

- By this histogram we have clear idea that by the social media ads most of the are not purchased the product. In this notebook We will deep dive into the data and try to find the reason and explore it further. 

In [None]:
plt.figure(figsize=(20,5))
bins_size = np.arange(15000,150000+10000,1000)
plt.hist(x="EstimatedSalary", data=df, bins= bins_size,rwidth=0.9);
plt.title('Distribution of Salary');
plt.ylabel('Count');
plt.xlabel('Salary');

- The chart is better explain the Distribution of Each Income level, Interesting there are customers in the mall with a very much comparable frequency with their Annual Income ranging from approx 75k US Dollars and 80K US Dollars. The average salary of the customers is 69742.5.

In [None]:
df.EstimatedSalary.mean()

In [None]:
plt.figure(figsize=(15,5))
bins_size = np.arange(18,65,2)
plt.hist(x="Age", data=df, bins= bins_size,rwidth=0.9);
plt.title('Distribution of Age');
plt.ylabel('Count');
plt.xlabel('Age');

- By looking at the above graph-, It can be seen that the Ages from 27 to 42 are very much frequent but there is no clear pattern, we can only find some group wise patterns such as the the older age groups are lesser frequent in comparison of youngsters.

### c. Distribution Plot – Target Variable

In [None]:
sns.distplot(df.Purchased);

### d. Aggregation for all numerical Columns

In [None]:
df.describe()

### e. Unique Values across all columns

In [None]:
df.Age.unique()

In [None]:
df.Age.nunique()

In [None]:
df.EstimatedSalary.unique()

In [None]:
df.EstimatedSalary.nunique()

In [None]:
df.Purchased.unique()

### f. Duplicate values across all columns

In [None]:
df.duplicated().sum()

### g. Correlation – Heatmap

In [None]:
plt.figure(figsize =(8,8))
ax= sns.heatmap(df.corr(),square = True, annot = True,cmap= 'Spectral' )
ax.set_ylim(4.0, 0)

- The Above Graph for Showing the linear correlation between the different attributes of the Mall Customer Segementation Dataset, This Heat map reflects the most correlated features with Blue Color and least correlated features with Red color.We can clearly see that only age has linearly related to the purchased. 

### h. Regression Plot

In [None]:
ax = sns.regplot(x="EstimatedSalary", y="Purchased", data=df)

In [None]:
ax = sns.regplot(x="Age", y="Purchased", data=df)

### i. Bar Plot

In [None]:
col = sns.color_palette()[0]

In [None]:
sns.barplot(x="Purchased", y="EstimatedSalary", data=df, color=col)

- By this bar plot we can take the idea that the person who purchased the product have more salary then the average.

In [None]:
sns.barplot(x="Purchased", y="EstimatedSalary",hue='Gender', data=df)

- females have higher average salary.
- females purchased more

### j. Pair plot

In [None]:
sns.pairplot(df, vars=["Age", "EstimatedSalary","Purchased"])

In [None]:
sns.pairplot(df, vars=["Age", "EstimatedSalary","Purchased"], hue = "Gender")

## 2. Drop all duplicate rows

In [None]:
df.drop_duplicates(inplace=True)

## 3. Drop all non-essential features

In [None]:
df.drop(columns=['User ID'], inplace = True)

## 4. Replace outliers with Nulls (if you find it essential) and replace all the nulls with respective approach of central tendencies (Mean/Median/Mode).

In [None]:
df.loc[((df.Age >58) & (df.Purchased==0)), 'Age'] = np.nan
df.fillna(53,inplace=True)

In [None]:
df.loc[(df.EstimatedSalary>120000) & (df.Purchased==0), 'EstimatedSalary'] = np.nan
df.fillna(120000,inplace=True)

## 5. Calculate Z score to validate whether outliers are still present or not.

In [None]:
from scipy import stats
z = np.abs(stats.zscore(df['EstimatedSalary']))
threshold = 3
print(np.where(z > 3))

In [None]:
z = np.abs(stats.zscore(df['Age']))
print(np.where(z > 3))

## 6. Clean the data with formatting issues if any. (converting datatypes, replacing dollars, etc.)

In [None]:
df.Age = df.Age.astype("int64")

In [None]:
df.EstimatedSalary = df.EstimatedSalary.astype("int64")

In [None]:
df.info()

## 7. Add your view of EDA to enhance understanding of data. i.e., Grouping data and observing the way data is distributed. Try to add as many layers of EDA as possible.

In [None]:
a = df.groupby(['Gender', 'Age'])
a.first()

In [None]:
a = df.groupby(['Purchased','EstimatedSalary'])
a.first()

In [None]:
sns.scatterplot(y="EstimatedSalary", x="Purchased", data=df)

In [None]:
plt.figure(figsize = (15,8))
sns.scatterplot(y="EstimatedSalary", x="Age", data=df, hue = 'Purchased')

In [None]:
sns.catplot(y="EstimatedSalary", x="Purchased", data=df, hue = 'Gender')

In [None]:
sns.catplot(y="Age", x="Purchased", data=df, hue = 'Gender')

## 8. Build a model of choice – Classification problem statement, hence build a classification model first and calculate Confusion Matrix, AUC, F1 Score, Precision, Recall and Accuracy.

In [None]:
df.Gender.replace({'Male':1,
                   'Female':0}, inplace=True)

In [None]:
df.head()

## Standardize the Variables
- Because the KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it, the scale of the variables matters. Any variables that are on a large scale will have a much larger effect on the distance between the observations, and hence on the KNN classifier, than variables that are on a small scale.

In [None]:
X = df.iloc[:, [1, 2]]
y = df.iloc[:, 3]

In [None]:
X

## KNN

### Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

### Using KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
knn_pred = knn.predict(X_test)

### Predictions and Evaluations

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,knn_pred))
print(classification_report(y_test,knn_pred))

### Choosing a K Value

In [None]:
error_rate = []

for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Here we can see that that after arouns K>4 the error rate is decreasing and if we take those values that maked overfitting so we take k = 4. So Let's retrain the model with that and check the classification report!

In [None]:
knn = KNeighborsClassifier(n_neighbors=4)

knn.fit(X_train,y_train)
knn_pred = knn.predict(X_test)

print('WITH K=4')
print('\n')
print(confusion_matrix(y_test,knn_pred))
print('\n')
print(classification_report(y_test,knn_pred))

In [None]:
from sklearn.metrics import accuracy_score
print ('accuracy_score : ', accuracy_score(y_test,knn_pred))

## K fold Cross Validation

> This technique is useful to evaluate bias and variance more accurately. It splits the training set into k groups, and in each iteration, the algorithm chooses different test fold (individual section) for testing. This allows every part of the training set to be used for testing.

In [None]:
from sklearn.model_selection import cross_val_score
knn_accuracy = cross_val_score(knn,X,y, cv = 5)

In [None]:
knn_accuracy

In [None]:
knn_accuracy.mean()

In [None]:
from sklearn.metrics import roc_curve, auc

knn_fpr, knn_tpr, threshold = roc_curve(y_test, knn_pred)
auc_knn = auc(knn_fpr, knn_tpr)

plt.figure(figsize=(5, 5), dpi=100)
plt.plot(knn_fpr, knn_tpr, marker='.', label='Knn (auc = %0.3f)' % auc_knn)

plt.xlabel('False Positive Rate -->')
plt.ylabel('True Positive Rate -->')

plt.legend()

plt.show()

## 9. Build at least a minimum of 4 different Classification models. All the models should use K-Fold cross Validation to train the model with at least 5-fold cross validation.

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train)
log_pred = log_reg.predict(X_test)

In [None]:
print(confusion_matrix(y_test,log_pred))
print('\n')
print(classification_report(y_test,log_pred))

## K- Fold CV

In [None]:
log_accuracy = cross_val_score(log_reg,X,y, cv = 5)
print(log_accuracy)
print("mean value of accuracy",log_accuracy.mean())

## SVM

In [None]:
from sklearn.svm import SVC
svc_classifier = SVC(kernel = 'rbf', random_state = 0)
svc_classifier.fit(X_train, y_train)
svc_pred = svc_classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,svc_pred))
print('\n')
print(classification_report(y_test,svc_pred))

## K- Fold CV

In [None]:
svc_accuracy = cross_val_score(svc_classifier,X,y, cv = 5)
print(svc_accuracy)
print("mean value of accuracy",svc_accuracy.mean())

## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
dt_classifier.fit(X_train, y_train)
dt_pred = dt_classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,dt_pred))
print('\n')
print(classification_report(y_test,dt_pred))

## K- Fold CV

In [None]:
dt_accuracy = cross_val_score(dt_classifier,X,y, cv = 5)
print(dt_accuracy)
print("mean value of accuracy",dt_accuracy.mean())

## Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
rf_classifier.fit(X_train, y_train)
rf_pred = rf_classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,rf_pred))
print('\n')
print(classification_report(y_test,rf_pred))

## K- Fold CV

In [None]:
rf_accuracy = cross_val_score(rf_classifier,X,y, cv = 5)
print(rf_accuracy)
print("mean value of accuracy",rf_accuracy.mean())

## 10. Compare the error and pick the ideal one with least errors.

> By seeing the confusion matrix and accuracy score `Random Forest classifier` perform batter than others.

In [None]:
print("For Random Forest Classifier::")
print(confusion_matrix(y_test,rf_pred))
print('\n')
print(classification_report(y_test,rf_pred))

## 11. Run hyperparameter tuning on all the models and pick the best parameters (A minimum of 2 Parameters should be tuned) and picked.

## SVM

In [None]:
#Applying grid search
from sklearn.model_selection import GridSearchCV
parameters = [{"C": [1, 10, 100, 1000], "kernel": ['linear']}, 
              {"C": [1, 10, 100, 1000], "kernel": ['rbf'], 'gamma': [0.5, 0.1, 0.01, 0.001]}]

#Use this list to train
grid_search = GridSearchCV(estimator = svc_classifier, param_grid = parameters, scoring = 'accuracy', cv = 10, n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)

#Use attributes of grid_search to get the results
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

print("Best accuracy: ",best_accuracy)
print(best_parameters)

In [None]:
from sklearn.svm import SVC
svc_classifier = SVC(kernel = 'rbf', random_state = 0, C =10, gamma=0.1)
svc_classifier.fit(X_train, y_train)
svc_pred = svc_classifier.predict(X_test)

In [None]:
print("For SVM Classifier::")
print(confusion_matrix(y_test,svc_pred))
print('\n')
print(classification_report(y_test,svc_pred))

## Random Forest Classification

In [None]:
n_estimators = [100, 300, 500]
max_depth = [5, 8, 15]
min_samples_leaf = [1, 2] 

hyperF = dict(n_estimators = n_estimators, max_depth = max_depth,min_samples_leaf = min_samples_leaf)

gridF = GridSearchCV(rf_classifier, hyperF, cv = 5, verbose = 1, 
                      n_jobs = -1)
bestF = gridF.fit(X_train, y_train)

In [None]:
bestF.best_estimator_

In [None]:
bestF.best_params_

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0, max_depth=5,min_samples_leaf=1)
rf_classifier.fit(X_train, y_train)
rf_pred = rf_classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,rf_pred))
print('\n')
print(classification_report(y_test,rf_pred))

## Decision Tree Classifier

In [None]:
criterion = ['gini', 'entropy']
max_depth = [4,6,8,12]

parameters = dict(criterion=criterion,max_depth=max_depth)

  
clf = GridSearchCV(rf_classifier, hyperF, cv = 5, verbose = 1, n_jobs = -1)

# Fit the grid search
clf.fit(X_train, y_train)

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_classifier = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=5)
dt_classifier.fit(X_train, y_train)
dt_pred = dt_classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,dt_pred))
print('\n')
print(classification_report(y_test,dt_pred))

## Logistic Regression

In [None]:
param_grid = [    
    {'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
    'max_iter' : [100, 1000,2500, 5000]
    }
]
clf = GridSearchCV(log_reg, param_grid = param_grid, cv = 3, verbose=True, n_jobs=-1)
best_clf = clf.fit(X_train, y_train)

In [None]:
a = best_clf.best_estimator_
a

In [None]:
best_clf.best_params_

In [None]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=1.0, solver='lbfgs',max_iter=100 )
log_reg.fit(X_train,y_train)
log_pred = log_reg.predict(X_test)

In [None]:
print(confusion_matrix(y_test,log_pred))
print('\n')
print(classification_report(y_test,log_pred))

## KNN

In [None]:
#List Hyperparameters that we want to tune.
leaf_size = list(range(1,10))
n_neighbors = list(range(1,10))
p=[1,2]

#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

#Use GridSearch
clf = GridSearchCV(knn, hyperparameters, cv=10, verbose=True, n_jobs=-1)
#Fit the model
best_model = clf.fit(X_train,y_train)

In [None]:
best_model.best_estimator_

In [None]:
best_model.best_params_

In [None]:
knn = KNeighborsClassifier(n_neighbors=9, p = 1, leaf_size=1)

knn.fit(X_train,y_train)
knn_pred = knn.predict(X_test)

print(confusion_matrix(y_test,knn_pred))
print('\n')
print(classification_report(y_test,knn_pred))

## 12. Now, compare the models and pick the ideal one.

### By the hyperparameter tuning `Decision Tree` prefome best in above model 

In [None]:
print(confusion_matrix(y_test,dt_pred))
print('\n')
print(classification_report(y_test,dt_pred))

## 13. Try to Predict the target with maximum independent features.

In [None]:
df.head()

In [None]:
X1 = df.iloc[:,0: 3]
y1 = df.iloc[:, 3]

In [None]:
X1

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size = 0.30, random_state = 0)

## Decision Tree

In [None]:
dt_classifier = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=5)
dt_classifier.fit(X_train, y_train)
dt_pred = dt_classifier.predict(X_test)

In [None]:
print(confusion_matrix(y_test,dt_pred))
print('\n')
print(classification_report(y_test,dt_pred))