<a href="https://colab.research.google.com/github/sahoomrutyunjaya12345/NewsPopularityPrediction/blob/main/NewsPopularityPredictionCapstoneProjecipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Title : Predicting the news popularity in multiple social media platforms.**

# Problem Description
## This is a large data set of news items and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn.The collected data relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: Economy, Microsoft, Obama and Palestine.

## Attribute Information:
* IDLink (numeric): Unique identifier of news items
* Title (string): Title of the news item according to the official media sources
* Headline (string): Headline of the news item according to the official media sources
* Source (string): Original news outlet that published the news item
* Topic (string): Query topic used to obtain the items in the official media sources
* PublishDate (timestamp): Date and time of the news items' publication
* SentimentTitle (numeric): Sentiment score of the text in the news items' title
* SentimentHeadline (numeric): Sentiment score of the text in the news items' headline
* Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook
* GooglePlus (numeric): Final value of the news items' popularity according to the social media source Google+
* LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn




# Summary

With the advancement in technology, news organizations have begun to rely more on online social platforms and media analytics as a way to attract readers. So, for news publishing sources, it’s become very important to know which kind of news articles will appeal more to the readers. In this project, firstly we have a news dataset which contains around 100000 news items published on three social media platforms: Facebook, Google Plus and LinkedIn, between November 2015 to July 2016 on four topics: Obama, Economy, Palestine, Microsoft. And we also have 12 social feedback dataset which contains the popularity level of news items in incremental time slices of 20 min after publication.

As the first step of our experiment, we performed Data Cleaning by removing trash and duplicate data, applied null value treatment and removed outliers in the data set using the 90th percentile quantile method. We further applied the Standardization technique for feature scaling.

In Exploratory Data Analysis, we categorized SentimentTitle, SentimentHeadline and sources to extract some meaningful insights from the data. Then we compared popularity between the social media platforms using multiple plots.

We applied text-preprocessing techniques to transform the headline and title of the news items.

For feature selection, we used ExtraTreeRegressor and Correlation matrix to obtain results on features.

For model prediction, we used supervised machine learning algorithms like Decision Trees, Catboost, LightGBM, Gradient Boosting, KNN and then applied hyperparameter tuning techniques to obtain better accuracy and to avoid overfitting.

In [None]:
#  textblob library to work with textual data
!pip install -U textblob
!pip install catboost

!pip uninstall scikit-learn -y
!pip install -U scikit-learn
!pip install shap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp310-none-manylinux1_x86_64.whl (76.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.6/76.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.1.1
Found existing installation: scikit-learn 1.2.2
Uninstalling scikit-learn-1.2.2:
  Successfully uninstalled scikit-learn-1.2.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.6/9.6 MB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-lea

In [None]:
# Importing all libraries
import numpy as np
import pandas as pd

# Visualisation Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Sklearn Libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor


# Model Libraries
import catboost as cb
from sklearn.tree import DecisionTreeRegressor 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from lightgbm import LGBMRegressor
from sklearn.ensemble import GradientBoostingRegressor


# Miscellaneous Libraries
from datetime import datetime
import time
import calendar
import random

import nltk
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from textblob import Word
import shap
import IPython

import warnings
warnings.filterwarnings('ignore')


nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Loading the dataset
sources= ['Facebook', 'GooglePlus', 'LinkedIn']
topics = ['Economy','Microsoft', 'Obama', 'Palestine']

folder_path = '/content/drive/MyDrive/AlmaBetter/NEWS POPULARITY PREDICTION CAPSTONE/Data & Resources/'
master_df = pd.read_csv(folder_path+'News_Final.csv')

df = {}
for source in sources:
  for topic in topics:
    file_name = f'{source}_{topic}.csv'
    file_path = f'{folder_path}{file_name}'
    df[f'{source}_{topic}'] = pd.read_csv(file_path)

## Datasets:
*   Social Media Feedback Datasets (12 Datasets)
*   News Dataset




## 1. Social Media Feedback Datasets (12 Datasets)

In [None]:
# viewing the dataset
df['Facebook_Microsoft'].head()

Unnamed: 0,IDLink,TS1,TS2,TS3,TS4,TS5,TS6,TS7,TS8,TS9,...,TS135,TS136,TS137,TS138,TS139,TS140,TS141,TS142,TS143,TS144
0,101,-1,-1,-1,-1,-1,30,30,30,30,...,131,131,131,131,131,131,131,131,133,133
1,102,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,57,57,57,57,57,57,57,58,58,58
2,103,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,259,259,260,260,260,260,261,262,262,263
3,104,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,13,13,13,13,13,13,13,13,13,13
4,105,-1,-1,-1,-1,-1,-1,-1,-1,-1,...,314,314,315,315,316,316,316,316,316,316


In [None]:
df['Facebook_Microsoft'].shape

(18531, 145)

In [None]:
# Checking the null values in the whole 12 dataset
for d in df:
    print(f'Null Values of {d}:',sum(df[d].isna().sum()))

Null Values of Facebook_Economy: 0
Null Values of Facebook_Microsoft: 0
Null Values of Facebook_Obama: 0
Null Values of Facebook_Palestine: 0
Null Values of GooglePlus_Economy: 0
Null Values of GooglePlus_Microsoft: 0
Null Values of GooglePlus_Obama: 0
Null Values of GooglePlus_Palestine: 0
Null Values of LinkedIn_Economy: 0
Null Values of LinkedIn_Microsoft: 0
Null Values of LinkedIn_Obama: 0
Null Values of LinkedIn_Palestine: 0


In [None]:
# Checking the size of all dataset
for d in df:
    print(f'Shape of {d}: {df[d].shape}')


Shape of Facebook_Economy: (29928, 145)
Shape of Facebook_Microsoft: (18531, 145)
Shape of Facebook_Obama: (27015, 145)
Shape of Facebook_Palestine: (7687, 145)
Shape of GooglePlus_Economy: (33069, 145)
Shape of GooglePlus_Microsoft: (20702, 145)
Shape of GooglePlus_Obama: (27157, 145)
Shape of GooglePlus_Palestine: (7749, 145)
Shape of LinkedIn_Economy: (33069, 145)
Shape of LinkedIn_Microsoft: (20702, 145)
Shape of LinkedIn_Obama: (27157, 145)
Shape of LinkedIn_Palestine: (7749, 145)


# Observations:
* All other datasets are similar to the dataset shown above.
* They have no null values.
* TS144 is the dependent variable of the news dataset.
* -1 level of popularity means the news item hasn't come to the platform yet.
* 0 level of popularity means the news item has landed the platform but is not at all popular as of now.
* 1 level of popularity means the news item's popularity has increased to 1 and so on.

# 2. News Dataset

In [None]:
news_df = pd.read_csv(folder_path+'News_Final.csv')
master_df = news_df.copy()
master_df.head()

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
0,99248.0,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,0.0,-0.0533,-1,-1,-1
1,10423.0,A Look at the Health of the Chinese Economy,"Tim Haywood, investment director business-unit...",Bloomberg,economy,2008-09-20 00:00:00,0.208333,-0.156386,-1,-1,-1
2,18828.0,Nouriel Roubini: Global Economy Not Back to 2008,"Nouriel Roubini, NYU professor and chairman at...",Bloomberg,economy,2012-01-28 00:00:00,-0.42521,0.139754,-1,-1,-1
3,27788.0,Finland GDP Expands In Q4,Finland's economy expanded marginally in the t...,RTT News,economy,2015-03-01 00:06:00,0.0,0.026064,-1,-1,-1
4,27789.0,"Tourism, govt spending buoys Thai economy in J...",Tourism and public spending continued to boost...,The Nation - Thailand&#39;s English news,economy,2015-03-01 00:11:00,0.0,0.141084,-1,-1,-1


In [None]:
# Checking the size of final news dataset
master_df.shape

(93239, 11)

In [None]:
master_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93239 entries, 0 to 93238
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   IDLink             93239 non-null  float64
 1   Title              93239 non-null  object 
 2   Headline           93224 non-null  object 
 3   Source             92960 non-null  object 
 4   Topic              93239 non-null  object 
 5   PublishDate        93239 non-null  object 
 6   SentimentTitle     93239 non-null  float64
 7   SentimentHeadline  93239 non-null  float64
 8   Facebook           93239 non-null  int64  
 9   GooglePlus         93239 non-null  int64  
 10  LinkedIn           93239 non-null  int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 7.8+ MB


In [None]:
master_df.describe()

Unnamed: 0,IDLink,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
count,93239.0,93239.0,93239.0,93239.0,93239.0,93239.0
mean,51560.653257,-0.005411,-0.027493,113.141336,3.888362,16.547957
std,30391.078704,0.136431,0.141964,620.173233,18.492648,154.459048
min,1.0,-0.950694,-0.755433,-1.0,-1.0,-1.0
25%,24301.5,-0.079057,-0.114574,0.0,0.0,0.0
50%,52275.0,0.0,-0.026064,5.0,0.0,0.0
75%,76585.5,0.064255,0.059709,33.0,2.0,4.0
max,104802.0,0.962354,0.964646,49211.0,1267.0,20341.0


# Observations:
* Title and Headline columns contain textual data. We would require to use TfidfVectorizer or CountVectorizer to deal with them.
* Topic is the categorical column.
* Source and Headline column contains some null values.
* -1 level of popularity shows that the particular news item didn't land on the platform in two days upon publication.
* We also have data whose popularity in all the social media platforms (Dependent Features) is -1. This implies that those are trash data.

## Data Cleaning and Refactoring
Let's reformat and clean the data for smooth processing!

**Dealing with negative popularities**

* We have negative popularities in Social Media Feedback datasets and news dataset.
* It was quite difficult to deal with negative popularities while data scaling, predicting, EDA, etc. So, we decided to increase all the levels of popularities by 1.
* This, in turn, makes 0 level of popularity as news item not landed on the platform yet upon publication, and so on.
* This step won't affect our prediction or analysis. It will only make it easier to deal with the data.

In [None]:
# Increasing Popularity level by 1 to deal with the level -1
for col in sources:
  master_df[col] = master_df[col].apply(lambda x:x+1)

for idf in df:
  for col in df[idf]:
    if col == 'IDLink':
      continue
    df[idf][col] += 1  

In [None]:
df['Facebook_Economy'].head()

Unnamed: 0,IDLink,TS1,TS2,TS3,TS4,TS5,TS6,TS7,TS8,TS9,...,TS135,TS136,TS137,TS138,TS139,TS140,TS141,TS142,TS143,TS144
0,1,0,0,0,0,0,0,0,0,8,...,14,14,14,14,14,14,14,14,14,14
1,2,0,0,0,0,0,0,0,0,4,...,43,43,43,43,43,43,43,43,43,43
2,3,0,0,0,0,0,0,0,0,0,...,99,99,99,99,99,99,99,99,99,99
3,4,0,0,0,0,0,0,0,0,0,...,8,8,8,8,8,8,8,8,8,8
4,5,0,0,0,0,0,0,0,0,0,...,35,35,35,35,35,35,35,35,35,35


In [None]:
# checking the null values
master_df.isnull().sum()

IDLink                 0
Title                  0
Headline              15
Source               279
Topic                  0
PublishDate            0
SentimentTitle         0
SentimentHeadline      0
Facebook               0
GooglePlus             0
LinkedIn               0
dtype: int64