Here, I have predicted the sentiment score of the Title and the Headline of the news articles. 

The target columns are:
- `SentimentTitle`, which is the sentiment score of the Title
- `SentimentHeadline`, which is the sentiment score of the Headline

I have used Custom Transform pipelines with Multi-Output Regressor in `scikit-learn`

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import nltk
# nltk.download('stopwords')
# print('Downloaded Stopwords')
from nltk.corpus import stopwords
import re
from xgboost import XGBRegressor
import xgboost as xgb
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
import seaborn as sns
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
stop_words = STOP_WORDS
import string
punctuations = string.punctuation
from sklearn.feature_extraction.text import HashingVectorizer

In [None]:
train = pd.read_csv('../input/news-popularity-in-multiple-social-media-platforms/train_file.csv')

In [None]:
train.head()

In [None]:
train.loc[0,'Headline']

In [None]:
train.loc[0,'Title']

In [None]:
missing_val = pd.DataFrame(train.isnull().sum())
missing_val = missing_val.reset_index()
missing_val

In [None]:
train[train['Source'].isna()]

In [None]:
train.dropna(inplace=True)

In [None]:
train.info()

In [None]:
train.describe().T

In [None]:
train['Topic'].value_counts()

### EDA & Data Visualization

In [None]:
import nltk
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(['Palestinian','Palestine','Microsoft','Economy','Obama','Barack'])

In [None]:
from wordcloud import WordCloud
plt.figure(figsize=(12,6))
text = ' '.join(train.Title[train['Topic']=='economy'])
wc = WordCloud(background_color='white',stopwords=stopwords).generate(text)
plt.imshow(wc)

In [None]:
from wordcloud import WordCloud
plt.figure(figsize=(12,6))
text = ' '.join(train.Title[train['Topic']=='obama'])
wc = WordCloud(background_color='white',stopwords=stopwords).generate(text)
plt.imshow(wc)

In [None]:
from wordcloud import WordCloud
plt.figure(figsize=(12,6))
text = ' '.join(train.Title[train['Topic']=='microsoft'])
wc = WordCloud(background_color='white',stopwords=stopwords).generate(text)
plt.imshow(wc)

In [None]:
from wordcloud import WordCloud
plt.figure(figsize=(12,6))
text = ' '.join(train.Title[train['Topic']=='palestine'])
wc = WordCloud(background_color='white',stopwords=stopwords).generate(text)
plt.imshow(wc)

In [None]:
sns.set(style='darkgrid',palette='Set1')

In [None]:
_ = sns.jointplot(x='SentimentTitle',y='SentimentHeadline',data=train,kind = 'reg')
_.annotate(stats.pearsonr)
plt.show()

In [None]:

# Bar graph exploring total sentiment for the different topics

train.groupby('Topic').agg('sum')[['SentimentHeadline', 'SentimentTitle']].plot(kind='bar', figsize=(25, 7),
                                                          stacked=True, color=['b', 'r', 'g']);

In [None]:
plt.figure(figsize=(15,15))
_ = sns.heatmap(train[['Facebook','GooglePlus','LinkedIn','SentimentTitle','SentimentHeadline']].corr(), square=True, cmap='Blues',linewidths=0.5,linecolor='w',annot=True)
plt.title('Correlation matrix ')

plt.show()

### Loading spacy English model

In [None]:
nlp = English()

### Custom Tokenizer Function

In [None]:
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = nlp(sentence)

    # here the token is converted into lowercase if it is a Pronoun and if it is not a Pronoun then it is lemmatized and lowercased    
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words using stopword from spacy library and punctuations from string library
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

### Custom Transformer and text cleaner 

In [None]:
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        
        return [clean_text(text) for text in X]

    def fit(self, X, y, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}


def clean_text(text):
   
    return text.strip().lower()

Converting words to word vectors

In [None]:
bow_vector = CountVectorizer(max_features = 100,tokenizer = spacy_tokenizer,ngram_range=(1,2))

Spearating Title and headline, so that they can be trained separately

In [None]:
X_train_title = train.loc[:,'Title'].values
y_train_title = train.loc[:,['SentimentTitle']].values

X_train_headline = train.loc[:,'Headline'].values
y_train_headline = train.loc[:,['SentimentHeadline']].values

In [None]:
X_train_title.shape

In [None]:
y_train_headline.shape

#### Splitting both Title and Headline into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
x_train_title, x_valid_title, Y_train_title, y_valid_title = train_test_split(X_train_title, y_train_title, shuffle = True, test_size = 0.15)
x_train_headline, x_valid_headline, Y_train_headline, y_valid_headline = train_test_split(X_train_headline, y_train_headline, shuffle = True, test_size = 0.15)

### XGBoost and Random Forrest Regressor

In [None]:
xgboost = MultiOutputRegressor(XGBRegressor())
rand_for = MultiOutputRegressor(RandomForestRegressor(n_estimators=100,
                                                          max_depth=None,
                                                          random_state=0))

Defining separate pipelines for title and headline. You can choose which regressor you want to use. In this notebook I have used the Random Forrest Regressor

In [None]:
pipe_title = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('tfidf',TfidfTransformer()),
                 ('regressor', rand_for)])

pipe_headline = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('tfidf',TfidfTransformer()),
                 ('regressor', rand_for)])

Train the Regressors for title and headline respectively

In [None]:
pipe_title.fit(x_train_title,Y_train_title)

In [None]:
pipe_headline.fit(x_train_headline,Y_train_headline)

Now we shall predict on the validation sets and then see what score we obtain

In [None]:
test_pred_title=pipe_title.predict(x_valid_title)

In [None]:
test_pred_headline=pipe_headline.predict(x_valid_headline)

Calculating the Mean Absolute errors for both Title and Headline sentiments

In [None]:
from sklearn.metrics import mean_absolute_error
mae_title=mean_absolute_error(y_valid_title,test_pred_title)
mae_headline=mean_absolute_error(y_valid_headline,test_pred_headline)

Here we caclulate our final score. Score is calulated as 

max(0, 1 - ((0.4*(mean abs error of title)+(0.6*(mean abs error of headline)))

In [None]:
score=1-((0.4*mae_title)+(0.6*mae_headline))

In [None]:
print("Score = {} \nScore(out of 100%) = {}%".format(score,score*100))

We achieved a score of 89.9. That is pretty good. This score is an indication of how close our predicted values were to the target values. It cannot exacly be termed as `accurcacy`, because this is not a classification problem. Our sentiment score is a real number between -1 and 1