## Hands-on Assignment Python

### Objective:
The shared data file summarizes a set of features about articles published by Mashable in two years. The aim is to predict the popularity of the news based on the number of shares and subsequently categorize them as Popular or Unpopular news articles. 

Url to dataset:
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

Citation:
K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

****NOTE: This dataset is one-hot encoded. Please merge into one column for the purpose of Exploratory Data Analysis. ****

### 1. Analyse the characteristics of Dataset and understand the features(refer to the metadata file)

In [None]:
import pandas as pd 
import os
path = "../input/online-news-popularity/OnlineNewsPopularity.csv"
df = pd.read_csv(path)

In [None]:
df.columns

In [None]:
df.head(5)

In [None]:
df.head()

In [None]:
df[[ ' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world',]].head()

In [None]:
df[[       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend']].head()

In [None]:
s = "weekday_is_friday"
airs_on = [       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend']
for day in airs_on:
    i = day.index("is_")
    print(day[i+3:])

In [None]:
chanel_type = [ ' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world',]
for chanel in chanel_type:
    i = chanel.index("is_")
    print(chanel[i+3:])

### 2. Perform Data Cleaning:
* Removing Duplicates
* Missing Data Analysis(If any)
* Imputing Missing Data / Dropping rows with missing data(upto your judgement)
* Please go ahead and do more analysis / cleaning if you deem necessary.





The below piece of code shows us the columns that have na values and we can see that none do so no need to clean that type of data

In [None]:
df.describe()

In [None]:
df.isna().sum().sum()

In [None]:
df[df.columns[df.isna().any()]]

The below piece of code shows us the number of duplicated rows and here again we see there are no duplicates

In [None]:
df.duplicated().sum().sum()

In [None]:
df[df.duplicated()]

The below piece of codes shows us that the data has no null values in it

In [None]:
df.isnull().sum().sum()


In [None]:
def reverse_one_hot_chanel(row):
    hot_columns = [' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world',]
    for column in df.columns:
        if(column in hot_columns):
            if round(row[column]) == round(1.0):
                i = column.index("is_")
                return column[i+3:]
def reverse_one_hot_weekday(row):
    hot_columns = [       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend']
    for column in df.columns:
        if(column in hot_columns):
            if round(row[column]) == round(1.0):
                i = column.index("is_")
                return column[i+3:]

In [None]:
df["chanel"]=df.apply(reverse_one_hot_chanel,axis=1)
df["weekday"] = df.apply(reverse_one_hot_weekday, axis =1)

In [None]:
df[["chanel","weekday"]].head()

### 3. Perform Exploratory Data Analysis: 
* Outlier Analysis / Removal
* Correlation Plot
* Check for Class Imbalance
* Univariate Analysis / Bi-variate Analysis with Target Variable(Convert Shares to binary variables: If shares > 1400 , news is considered popular)
* Please go ahead and do more analysis if you deem necessary.
* Please share a concise note of your findings from the EDA. (One you would share with your client.)
* Which models would you opt for classification and why?



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
target = df[" shares"]

This snippet of code shows us how many news websites are popular while also showing us which are unpopular  without the removal of outliers

In [None]:
print(len(target[target>1400]))
print(len(target)-len(target[target>1400]))

From the below piece of code we can see the figure shwos practivcally nothing but many outliers in order for use to clear these outliers we will 

In [None]:
plt.figure(figsize=(12, 7))
plt.boxplot(target)
plt.show()

In [None]:

# finding the 1st quartile
q1 = np.quantile(target, 0.25)
 
# finding the 3rd quartile
q3 = np.quantile(target, 0.75)
med = np.median(target)
 
# finding the iqr region
iqr = q3-q1
 
# finding upper and lower whiskers
upper_bound = q3+(1.5*iqr)
lower_bound = q1-(1.5*iqr)

The graph is now much easier to read with the removal of serval major outliers but we can still see a vast number of outliers in the data. we can run this same process again to further narrow down the data

In [None]:
# boxplot of data within the whisker
new_target = target[(target >= lower_bound) & (target <= upper_bound)]

plt.figure(figsize=(12, 7))
plt.boxplot(new_target)
plt.show()

In [None]:
q1 = np.quantile(new_target, 0.25)
 
# finding the 3rd quartile
q3 = np.quantile(new_target, 0.75)
med = np.median(new_target)
 
# finding the iqr region
iqr = q3-q1
 
# finding upper and lower whiskers
upper_bound = q3+(1.5*iqr)
lower_bound = q1-(1.5*iqr)
final_target = new_target[(new_target >= lower_bound) & (new_target <= upper_bound)]
plt.figure(figsize=(12, 7))
plt.boxplot(final_target)
plt.show()

the below code snippet shows us the distribution that the final target follows we can see that the majjority of the data seems to fall near 1000

In [None]:
import seaborn as sns

sns.histplot(final_target)


In [None]:
target= target[final_target]
df=df.iloc[target.index]

In [None]:
plt.figure(figsize=(14, 7))
sns.pairplot(df, y_vars=" shares", x_vars=df.columns.values)

from the above plots we can see the relationship between the target (shares) and all other features

In [None]:
#get correlations of each features in dataset
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(100,100))
#plot heat map
sns.set(font_scale = 2)

g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")


The data has been well cleaned no missing values or duplicates so that was a big plus. 
We can see alot of positive/ negative correlations from this plot and the correlation heat map. This will make the feature selection rather simple. In order to determine the popularity of the webiste we will use three different classifers the ones in mind currently are Logistic Regression, Preceptron, and Naive Bayes

### 4. Feature Engineering / Selection:
Perform Feature Engineering based on conclusions from EDA and use relevant feature selection techniques for analysis.

In [None]:
#one is popular zero is not
ans = []
for i in df[" shares"]:
    if i >1400:
        ans.append(1)
    else:
        ans.append(0)
df["target"] = ans

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df["url"] = le.fit_transform(df["url"])
df["chanel"] = le.fit_transform(df["chanel"])
df["weekday"] = le.fit_transform(df["weekday"])

X = df[[c for c in df.columns if c != " shares"  and c!= "target"]]
X = X.apply(lambda x: x**2)
y = df["target"]
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
feature_scores = pd.concat([dfcolumns,dfscores],axis=1)
feature_scores.columns = ['Feature','Score']  #naming the dataframe columns
print("Top 10 best features")
print(feature_scores.nlargest(10,'Score'))

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='bar')


In [None]:
true_best_features = set(feature_scores["Feature"]).intersection(set(feat_importances.nlargest(10).index))
print(true_best_features)

The above snippets of code used two seperate meeasure to determine the top 10 featueres the first used sklearns select best module and the second used the ExtraTreeClassifer to determine them. Then we took the two resulting sets from both methods and took the intersection to ensure we got the best possible features.

### 5. Model Training:
* Split your dataset into Train and Test.(80/20 Split)
* Create a classification model to predict whether the news is popular or unpopular.(If shares > 1400 , news is considered popular)
* Try atleast 3 classification models and check for overfitting.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
xTrain,xTest,yTrain,yTest =train_test_split(np.array(X[true_best_features]), np.array(y), test_size = 0.2, random_state = 2)
display(xTrain.shape,xTest.shape,yTrain.shape,yTest.shape)

In [None]:
lr = LogisticRegression(random_state=0, max_iter =100000)
rf = RandomForestClassifier(max_features=5,n_estimators = 10)
nb = BernoulliNB()

In [None]:
import time
start = time.time()
lr.fit(xTrain,yTrain)
finish = time.time()- start
print(f"Logisitic Regression took {finish} seconds to complete training")

In [None]:
start = time.time()
rf.fit(xTrain,yTrain)
finish = time.time()- start
print(f"Random Forest took {finish} seconds to complete training")

In [None]:
start = time.time()
nb.fit(xTrain,yTrain)
finish = time.time()- start
print(f"Naive Baynes took {finish} seconds to complete training")

In [None]:
lr_train_score= lr.score(xTrain,yTrain)
nb_train_score = nb.score(xTrain,yTrain)
rf_train_score = rf.score(xTrain,yTrain)

print(f"Logistic Regression Training Score {lr_train_score}")
print(f"Random Forest Training Score {rf_train_score}")
print(f"Naive Baynes Training Score {nb_train_score}")

In [None]:
lr_test_score= lr.score(xTest,yTest)
nb_test_score = nb.score(xTest,yTest)
rf_test_score = rf.score(xTest,yTest)

print(f"Logistic Regression Test Score {lr_test_score}")
print(f"Random Forest Test Score {rf_test_score}")
print(f"Naive Baynes Test Score {nb_test_score}")

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, xTrain, yTrain, cv=10)
print(f"Cross Validation Scores for Logistic Regression\n{scores}\n")
scores = cross_val_score(nb, xTrain, yTrain, cv=10)
print(f"Cross Validation Scores for Naive Baynes \n{scores}\n")
scores = cross_val_score(rf, xTrain, yTrain, cv=10)
print(f"Cross Validation Scores for Random Forest\n{scores}\n")


We can see from the above code snippets that the ranking of the classifiers goes as follows: Random Forest, Naive Baynes, Logistic Regression. 
However this is for only one of the random partions of the data. In the below code we will train with 50 different random partions of the data to ensure the trends continue.
We will fit the models to the training data then compare the current models test score to the best models and at the end we will select the version of the model that performs the best on the test data

In [None]:
lr_best = (None,0)
rf_best =  (None,0)
nb_best = (None,0)
for i in range(50):
    xTrain,xTest,yTrain,yTest =train_test_split(np.array(X[true_best_features]), np.array(y), test_size = 0.2, random_state = 2)
    lr = LogisticRegression(random_state=0, max_iter =100000)
    rf = RandomForestClassifier(max_features=5,n_estimators = 10)
    nb = BernoulliNB()
    lr.fit(xTrain,yTrain)
    rf.fit(xTrain,yTrain)
    nb.fit(xTrain,yTrain)
    lr_test_score= lr.score(xTest,yTest)
    nb_test_score = nb.score(xTest,yTest)
    rf_test_score = rf.score(xTest,yTest)
    if(lr_test_score>lr_best[1]):
        lr_best = (lr,lr_test_score)
        
    if(nb_test_score>nb_best[1]):
        nb_best = (nb,nb_test_score)
    
    if(rf_test_score>rf_best[1]):
        rf_best = (rf,rf_test_score)

In [None]:
lr, lr_top_score = lr_best
rf, rf_top_score = rf_best
nb, nb_top_score = nb_best
print(f"Logistic Regression best possible score {lr_top_score}")
print(f"Random Forest best possible score {rf_top_score}")
print(f"Naive Baynes best possible score {nb_top_score}")

perfrom cross valdiation again

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, xTrain, yTrain, cv=10)
print(f"Cross Validation Scores for Logistic Regression\n{scores}\n")
scores = cross_val_score(nb, xTrain, yTrain, cv=10)
print(f"Cross Validation Scores for Naive Baynes \n{scores}\n")
scores = cross_val_score(rf, xTrain, yTrain, cv=10)
print(f"Cross Validation Scores for Random Forest\n{scores}\n")

### 6. Model Evaluation
* Plot ROC curve & report the AUC.
* Recommend which model is best with your reasoning.


Each of the classifiers individual probabilites 

In [None]:
r_probs = [0 for _ in range(len(yTest))]
lr_probs = lr.predict_proba(xTest)
rf_probs = rf.predict_proba(xTest)
nb_probs = nb.predict_proba(xTest)

In [None]:
lr_probs =lr_probs[:,1]
rf_probs =rf_probs[:,1]
nb_probs =nb_probs[:,1]

Finds the AUC scores for each Classifier

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score
r_auc = roc_auc_score(yTest, r_probs)
rf_auc = roc_auc_score(yTest, rf_probs)
nb_auc = roc_auc_score(yTest, nb_probs)
lr_auc = roc_auc_score(yTest, lr_probs)

print(f"Random (chance) Prediction: AUROC = {r_auc}")
print(f"Random Forest Prediction: AUROC = {rf_auc}")
print(f"Naive Baynes (Bernoulli) Prediction: AUROC =  {nb_auc}")
print(f"Logistic Regression Prediction: AUROC =  {lr_auc}")

Calculate and plot the ROC Curves

In [None]:
r_fpr, r_tpr, _ = roc_curve(yTest, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(yTest, rf_probs)
nb_fpr, nb_tpr, _ = roc_curve(yTest, nb_probs)
lr_fpr, lr_tpr, _ = roc_curve(yTest, lr_probs)

In [None]:

plt.figure(figsize=(20,10))

plt.plot(r_fpr, r_tpr, linestyle='--', label=f"Random Chance")
plt.plot(rf_fpr, rf_tpr, marker='.', label=f"Random Forest")
plt.plot(nb_fpr, nb_tpr, marker='.', label=f"Naive Baynes (Bernoulli)")
plt.plot(lr_fpr, lr_tpr, marker='.', label=f"Logistic Regression")
plt.legend(loc ="lower right",prop={'size': 20})
# Title
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')


# Show legend
# Show plot


From the analysis we've conducted we have concluded that the best features to use are {' kw_avg_avg', ' num_keywords', ' kw_avg_max', ' rate_positive_words', ' timedelta', ' abs_title_subjectivity', ' is_weekend', ' shares', ' min_negative_polarity', ' max_positive_polarity'}. After the features were selected we continued to test the models against various means to ensure no overfitting of training data was occuring and the model that proved to be the best by means of sklearn metrics score, cross validation score, AUC score, and by the ROC score is Random Forest Classifier.
