# <font color='green'>Predicting Stocks (Goes up or down) using News Headlines</font>

### * The kernel is all about creating a model to predict the stocks whether they go up or down based on the top 25 headlines 
 
### * The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".
 
### * In Label column the value is "1" when DJIA Adj Close value rose or stayed as the same
 
### * In Label column the value is "0" when DJIA Adj Close value decreased.

## <font color='darkred'>Objective :</font>
### The goal is to create a machine learning model that predicts whether the stock goes up or down based on top 25 headlines 

## <font color='darkred'>Whole process in detail :</font>
### 1)  Filling null values in the dataset with median

### 2)  Combining all the headlines into one news 

### 3)  Cleaning the text by removing punctuations and changing all the letters to lowercase

### 4)  Applying countvectorizer to all the headlines

### 5)  Visualizing the results and choosing the best algorithm based on requirements

In [None]:
import pandas as pd
import numpy as np 
import warnings
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import plotly.graph_objects as go
import plotly.express as px


warnings.filterwarnings('ignore')

# imported the file which contains top 25 headlines, stock went up or down(label) and date
data1 = pd.read_csv('../input/stocknews/Combined_News_DJIA.csv')
data1.head()

In [None]:
data1.isnull().sum()

## <font color='darkred'>Data Cleaning</font>

In [None]:
# filling the null values with median 

data1['Top23'].fillna(data1['Top23'].median,inplace=True)
data1['Top24'].fillna(data1['Top24'].median,inplace=True)
data1['Top25'].fillna(data1['Top25'].median,inplace=True)

In [None]:
# seperating the data into train and test

train = data1[data1['Date'] < '20150101']
test = data1[data1['Date'] > '20141231']

In [None]:
# removing punctuations and changing all the letters to lowercase for both train and test

all_data = [train,test]

for df in all_data:
    df.replace("[^a-zA-Z]"," ",regex=True, inplace=True)
    for i in df.columns:
        if i=='Date':
            continue
        if i=='Label':
            continue
        df[i] = df[i].str.lower()

train.head()

In [None]:
# combining all the headlines in train data into one and appending them into a list 

headlines = []
for row in range(0,len(train.index)):
    headlines.append(' '.join(str(x) for x in train.iloc[row,2:]))
headlines[0]

In [None]:
# combining all the headlines in test data into one and appending them into a list 

test_transform= []
for row in range(0,len(test.index)):
    test_transform.append(' '.join(str(x) for x in test.iloc[row,2:27]))

## <font color='darkred'>Applying Machine Learning Algorithms (Random forest , XGBOOST and CATBoost)</font>

In [None]:
# Applying countvectorizer on headlines list that we created before and max features is set to 100009

countvector=CountVectorizer(ngram_range=(2,2),max_features=100009)
traindataset=countvector.fit_transform(headlines)

randomclassifier=RandomForestClassifier(n_estimators=200,criterion='entropy')
randomclassifier.fit(traindataset,train['Label'])



<font color='darkblue'>The maximum features for countvectorizer is set to 100009 because, i tried many other numbers for maximum features and for 100009 i got the best accuracy, with lowest False positive values ( you can see below in the confusion matrix you can try other values and check it yourself, if you find the best accuracy with other maximum features then comment below</font>

### <font color='darkred'>Random forest without hyperparameter tuning</font>

In [None]:
# Applying countvectorizer on test_transform list that we created before 

test_dataset = countvector.transform(test_transform)
predictions = randomclassifier.predict(test_dataset)

In [None]:
# confusion matrix for 

matrix=confusion_matrix(test['Label'],predictions)
print(matrix)

In [None]:
# accuracy score (compared test daset original output values with predictions)

score=accuracy_score(test['Label'],predictions)
print(score)

<font color='darkblue'>Lets apply XGBoost , and will also try different numbers of max features for countvectorizer and see which number gives us the maximum accuracy</font>




### <font color='darkred'>XGBoost without hyperparameter tuning</font>

In [None]:
max_features_num = [500,600,700,800,900,1000]
ngram = [1,2,3,4,5]
for i in max_features_num:
    for j in ngram:
        countvector=CountVectorizer(ngram_range=(j,j),max_features=i)
        traindataset=countvector.fit_transform(headlines)
        test_dataset = countvector.transform(test_transform)

        xgb = XGBClassifier(random_state =1)
        xgb.fit(pd.DataFrame(traindataset.todense(), columns=countvector.get_feature_names()),train['Label'])
        predictions = xgb.predict(pd.DataFrame(test_dataset.todense(), columns=countvector.get_feature_names()))
        score=accuracy_score(test['Label'],predictions)
        print('max number of features used : {}'.format(i))
        print('ngram_range ({},{})'.format(j,j))
        print(score)
        matrix=confusion_matrix(test['Label'],predictions)
        print('confusion matrix : {}'.format(matrix))
        print('===============================')

<font color='darkblue'>Maximum accuracy :</font>

max number of features used : 800

ngram_range (2,2)

0.8650793650793651

confusion matrix : [[161  25]
 [ 26 166]]

In [None]:
countvector=CountVectorizer(ngram_range=(1,1),max_features=800)
traindataset=countvector.fit_transform(headlines)
test_dataset = countvector.transform(test_transform)


xgb = XGBClassifier(random_state =1)
xgb.fit(pd.DataFrame(traindataset.todense(), columns=countvector.get_feature_names()),train['Label'])
predictions = xgb.predict(pd.DataFrame(test_dataset.todense(), columns=countvector.get_feature_names()))

In [None]:
predictions

### <font color='darkred'>CATBoost without hyperparameter tuning</font>

In [None]:
cb=CatBoostClassifier(random_state=1)
cb.fit(pd.DataFrame(traindataset.todense(), columns=countvector.get_feature_names()),train['Label'])
predictions = xgb.predict(pd.DataFrame(test_dataset.todense(), columns=countvector.get_feature_names()))
matrix=confusion_matrix(test['Label'],predictions)
score=accuracy_score(test['Label'],predictions)
print(score)
print('===============')
print(matrix)

<font color='darkblue'>Catboost is giving the same results as xgboost

Now lets use hyperparameters and see whether the model is improving or not 
    
At first we will perform hyperparameter tuning for random forest</font>

###  <font color='darkred'>Random forest with hyperparameter tuning </font>

In [None]:
def performance(classifier, model_name):
    print(model_name)
    print('Best Score: ' + str(classifier.best_score_))
    print('Best Parameters: ' + str(classifier.best_params_))


rf = RandomForestClassifier(random_state = 1)
param_grid =  {'n_estimators': [100,300,400],
               'criterion':['gini','entropy'],
                                  'bootstrap': [True,False],
                                  'max_depth': [None,15, 20],
                                  'max_features': ['auto', 10],
                                  'min_samples_leaf': [1,2,5],
                                  'min_samples_split': [2,3,5]}

clf_rf = GridSearchCV(rf,param_grid = param_grid, cv=5 , verbose = True, n_jobs = -1)
best_clf_rf = clf_rf.fit(traindataset,train['Label'])
performance(best_clf_rf,'Random Forest')

In [None]:
best_rf = best_clf_rf.best_estimator_

In [None]:
countvector=CountVectorizer(ngram_range=(2,2))
traindataset=countvector.fit_transform(headlines)
test_dataset = countvector.transform(test_transform)

best_rf.fit(traindataset,train['Label'])
predictions = best_rf.predict(test_dataset)
predictions

In [None]:
score=accuracy_score(test['Label'],predictions)
print(score)
print('========================')
print('confusion matrix :')
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)

<font color='darkblue'>As you can see that performing hyperparameter tuning on randomforest made the model good predictions and also decreased the false negative value.
</font>


<font color='darkblue'>Lets use hyperparameter tuning for XGBOOST and see if the accuracy is improving or not</font>

 ###  <font color='darkred'>XGBoost with hyperparameter tuning </font>

In [None]:
countvector=CountVectorizer(ngram_range=(1,1),max_features=800)
traindataset=countvector.fit_transform(headlines)
test_dataset = countvector.transform(test_transform)

xgb = XGBClassifier(random_state =1)
param_grid = {
    'n_estimators': [500,550,600,650],
    'colsample_bytree': [0.75,0.8,0.85],
    'max_depth': [None],
    'reg_alpha': [1],
    'reg_lambda': [2, 5, 10],
    'subsample': [0.55, 0.6, .65,0.9],
    'learning_rate':[0.5],
    'gamma':[.5,1,2],
    'min_child_weight':[0.01],
    'sampling_method': ['uniform']
}

clf_xgb = RandomizedSearchCV(xgb, param_distributions = param_grid, cv = 5, verbose = True, n_jobs = -1)
best_clf_xgb = clf_xgb.fit(pd.DataFrame(traindataset.todense(), columns=countvector.get_feature_names()),train['Label'])
performance(best_clf_xgb,'XGB')

In [None]:
best_clf_xgb = best_clf_xgb.best_estimator_

best_rf.fit(traindataset,train['Label'])
predictions = best_rf.predict(pd.DataFrame(test_dataset.todense(), columns=countvector.get_feature_names()))
predictions

In [None]:
score=accuracy_score(test['Label'],predictions)
print('score :')
print(score)
print('==================================')
print('confusion matrix :')
matrix=confusion_matrix(test['Label'],predictions)
print(matrix)

<font color='darkblue'>As you can see above , after using hyperparameters for XGBoost , the accuracy didn't improve </font>

## <font color='darkred'> Conclusion</font>



<font color='darkblue'>After all this analysis we can conclude that the best algorithm which gave good accuracy and less false negetive values is randomforest using hyperparameter tuning

If you care about more true positive values and less on false negetive values then the best algorithm for you is XGBOOST without hyperparameter tuning</font>

In [None]:
fin_score = {'randomforest (without hp)':0.859788 , 'randomforest (with hp)':0.851851,
             'XGBoost (without hp)':0.8650793,'XGBoost (with hp)':0.806878,'CATBoost(without hpt)':0.83597}
import plotly
plotly.offline.init_notebook_mode (connected = True)

In [None]:
px.bar(x = list(fin_score.keys()),y = list(fin_score.values()),title='ACCURACY SCORE FOR RF AND XGB (WITH AND WITHOUR HYPERPARAMETERS)',labels={'x':'Algorithms','y':'Score'})

In [None]:
x1 = ['randomforest(without hpt)','randomforest(with hpt)','XGBoost(without hpt)','XGBoost(with hpt)','CATBoost(without hpt)']
x1_TP = [135,131,161,139,154]
X1_FN = [5,1,26,26,30]

In [None]:
fig = go.Figure(data=[
    go.Bar(name='TRUE POSITIVE', x=x1, y=x1_TP),
    go.Bar(name='FALSE NEGATIVE', x=x1, y=X1_FN)
])

fig.update_layout(barmode='group')
fig.show()

#### <font color='darkblue'>If you want to have a clear pictures of which model performed well and which model got more true positives and false negatives ,you can see the above visualizations and decide which model you need according to your requirements</font>

#### <font color='darkblue'>If you like my work please do upvote my kernel and if you have any suggestions please do comment below</font>

#### <font color='darkblue'>Thank you :)</font>