## Social Network Ads - Logical Regression

### Social Network Advertisement 

Context
=========
There's a story behind every dataset and here's your opportunity to share yours.

Content
========
What's inside is more than just rows and columns. Make it easy for others to get started by describing 
how you acquired the data and what time period it represents, too.

Acknowledgements
===================
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration
============
Your data will be in front of the world's largest data science community. What questions do you want to see answered?

In [None]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing Pandas and NumPy
import pandas as pd, numpy as np

In [None]:
# Importing all datasets
social_ads = pd.read_csv("/kaggle/input/social-network-ads/Social_Network_Ads.csv")
social_ads.head()

In [None]:
social_ads.Purchased.value_counts()

In [None]:
social_ads.shape

In [None]:
# let's look at the statistical aspects of the dataframe
social_ads.describe()

In [None]:
# Let's see the type of each column
social_ads.info()

### Step 3: Data Preparation

In [None]:
social_ads.dtypes

#### Converting some binary variables (Yes/No) to 0/1

In [None]:
# List of variables to map

varlist =  ['Gender']

# Defining the map function
def binary_map(x):
    return x.map({"Male": 1, "Female": 0})

# Applying the function to the housing list
social_ads[varlist] = social_ads[varlist].apply(binary_map)

In [None]:
social_ads.head(2)

In [None]:
social_ads.dtypes

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
# Apply scaler() to all the columns except the 'yes-no' and 'dummy' variables
num_vars = ["User ID","Age","EstimatedSalary"]

social_ads[num_vars] = scaler.fit_transform(social_ads[num_vars])

social_ads.head()

### Checking if there is any null values 

In [None]:
social_ads.isnull().sum()

In [None]:
# Checking the percentage of missing values
round(100*(social_ads.isnull().sum()/len(social_ads.index)), 2)

From , the `Above Dataset`, the max-min scaler is used to put all the values between 0 and 1

### Checking for Outliers

In [None]:
# Checking for outliers in the continuous variables
num_social_ads = social_ads[["User ID","Gender","Age","EstimatedSalary","Purchased"]]

In [None]:
# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%
num_social_ads.describe(percentiles=[.25, .5, .75, .90, .95, .99])

## Step 4: Test-Train Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = social_ads
X.head()

In [None]:
# Putting response variable to y
y = social_ads['EstimatedSalary']

In [None]:
y.head()
# Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)

In [None]:
EstimatedSalary = (sum(social_ads['EstimatedSalary'])/len(social_ads['EstimatedSalary'].index))*100
EstimatedSalary

## Building our model

This time, we will be using the **LinearRegression function from SciKit Learn** for its compatibility with RFE (which is a utility from sklearn)

### RFE
Recursive feature elimination

In [None]:
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

In [None]:
# Running RFE with the output number of the variable equal to 10
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 5)             # running RFE
rfe = rfe.fit(X_train, y_train)

In [None]:
rfe.support_

In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
col = X_train.columns[rfe.support_]
col

In [None]:
X_train.columns[~rfe.support_]

### Step 6: Looking at Correlations

In [None]:
# Importing matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# Let's see the correlation matrix 
plt.figure(figsize = (8,5))        # Size of the figure
sns.heatmap(social_ads.corr(),annot = True)
plt.show()

### Step 7: Model Building
Let's start by splitting our data into a training set and a test set.

#### Running Your First Training Model

In [None]:
import statsmodels.api as sm

In [None]:
# Logistic regression model
logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()