## The global warming issue and Narratives around it<br>
### Part 6: Testing the trained model onto a new dataset

In this notebook, I tested the trained NLP model onto a new dataset, which belonged to a more general class of [climate change](https://www.reddit.com/r/climatechange/) issues. The hope was to filter the global warming concerns from the others.

Importing the required libraries:

In [1]:
#imports
import requests
import pandas as pd
import time

#Importing the built function, prior to that added the assets path to the system path
#Inspiration: https://stackoverflow.com/questions/4383571/importing-files-from-different-folder

import sys
# inserting the parent directory into current path
sys.path.insert(1, '../assets')

from get_reddit_posts import get_reddit_posts



import pandas as pd
import regex as re
import warnings
warnings.filterwarnings('ignore')
from nltk.corpus import stopwords # Import the stopword list
import pickle
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer


import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

import regex as re
import nltk
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression


from nltk.corpus import stopwords

import pickle

Calling in the created function and reading global warming data:

In [2]:

#Defining API pull initial parameteres:

par = {"subreddit": "climatechange", #The subreddit title
       "post_num": 100, # Numer of posts to pull from
        "time_1": int(time.mktime(time.strptime('1 July, 2020', '%d %B, %Y'))), # The latest pull time
       "API_limit": 100, # API pull number limits for reddit per time
       "API_wait": 1 #API wait time berfore the next pull
      }



df_test_reddit = get_reddit_posts(par["subreddit"], par["post_num"], par["time_1"], par[ "API_limit"], par["API_wait"])

100 posts downloaded, oldest post:2020-06-20 10:40:38 - status code: 200, now waiting 1 seconds before next pull. Patience...
[92m Does the imported dataframe match the request? True[00m
Final DataFrame shape: (100, 71), there are 0 duplicates


### Cleaning the new dataset

In [3]:
to_keep_clmns = ['author', 'created_utc', 'domain', 'id', 'num_comments', 'over_18',
       'post_hint', 'score', 'selftext',
       'subreddit_subscribers', 'title']

df_test_reddit = df_test_reddit[to_keep_clmns]

df_keep_reddit = df_test_reddit.copy()

df_test_reddit.head(10)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,selftext,subreddit_subscribers,title
0,Newman1651,1593569625,en.mercopress.com,hj1f11,1,False,,1,,25629,Sea ice in the Weddell Sea has decreased by on...
1,CharlieBrown829,1593569315,self.climatechange,hj1c7i,60,False,,1,,25629,Are greenhouse gas emissions still going down ...
2,LackmustestTester,1593555082,nature.com,hixhvn,2,False,link,1,,25621,Floodplain inundation spectrum across the Unit...
3,iamchitranjanbaghi,1593550152,self.climatechange,hivz86,20,False,,1,co2 becomes liquid under pressure and mariana ...,25619,putting co2 in deep mariana trench.
4,coffeecream22,1593549986,self.climatechange,hivxd7,32,False,self,1,He [posted a long letter](https://environmenta...,25619,Does anyone have science-based responses to Mi...
5,Gorgulak,1593542301,nature.com,hitako,1,False,link,2,,25617,New multi-method ensemble approach to reconstr...
6,ChargersPalkia,1593537578,self.climatechange,hirohk,5,False,self,1,https://www.npr.org/2019/02/05/691734652/the-n...,25612,Is this legit? Someone sent me this link about...
7,TheMineInventer,1593533686,self.climatechange,hiqdy2,0,False,,3,[removed],25610,The main mod of this server is also the mod of...
8,AndeanBear101,1593531549,self.climatechange,hippaf,0,False,,1,[removed]\n\n[View Poll](https://www.reddit.co...,25609,Climate footprint
9,AndeanBear101,1593531504,self.climatechange,hipoqc,0,False,,1,[removed],25610,Cutting carbon footprint in home


In [4]:
#For title and selftext columns, I filled them with " " as they will be striped later, so I can merge them later.

df_test_reddit["title"].fillna(" ", inplace=True)
df_test_reddit["selftext"].fillna(" ", inplace=True)

#Merging the title and selftext for further processing

df_test_reddit['text_merged'] = df_test_reddit['title'] + " " + df_test_reddit['selftext']
df_test_reddit.drop(columns = ["title", "selftext"], inplace=True)

#For subscribers, I imputed them with median valeus
df_test_reddit['subreddit_subscribers'].fillna(df_test_reddit['subreddit_subscribers'].median(), inplace=True)

#For post_hint, I imputed them with "Empty"
df_test_reddit['post_hint'].fillna("Empty", inplace=True)

In [5]:
#Lots pof cleaning on text


#Removing "\n" characters
df_test_reddit['text_merged'] = df_test_reddit['text_merged'].apply(lambda x: re.sub('\n', ' ', x))

#Removing the [removed] characters
df_test_reddit['text_merged'] = df_test_reddit['text_merged'].replace('[removed]', ' ')


# Use regular expressions to do a find-and-replace
df_test_reddit['text_merged'] = df_test_reddit['text_merged'].apply(lambda x: re.sub("[^a-zA-Z]", " ", x))

df_test_reddit['text_merged'] = df_test_reddit['text_merged'].apply(lambda x: x.lower())



#Laste step
#Replacing multiple spaces
#Source: https://pythonexamples.org/python-replace-multiple-spaces-with-single-space-in-text-file/
df_test_reddit['text_merged'] = df_test_reddit['text_merged'].apply(lambda x: ' '.join(x.split()))


#Removing stop words and stemming

def remove_stops_stem(item):

    stops = stopwords.words('english')
    words = [w for w in item.split() if w not in stops]#stops
    
    #lemmatizer = WordNetLemmatizer()
    #words = [lemmatizer.lemmatize(i) for i in words]
    
    # Instantiate object of class PorterStemmer.
    p_stemmer = PorterStemmer()
    words = [p_stemmer.stem(i) for i in words]
    
    
    
    words = " ".join(list(words)) # Adding space
    
    return words

df_test_reddit['text_merged'] = df_test_reddit['text_merged'].apply(remove_stops_stem)



#Stemming


df_test_reddit.reset_index(drop=True, inplace=True)

### Engineering the features

In [6]:
#Counting the charcaters and word in "text_merged"
df_test_reddit["text_char_count"] = df_test_reddit["text_merged"].map(lambda x: len(x))
df_test_reddit["text_word_count"] = df_test_reddit["text_merged"].map(lambda x: len(x.split(" ")))

#Sentiment analyzer
sent = SentimentIntensityAnalyzer()

df_test_reddit['sentiment_score'] = df_test_reddit["text_merged"].apply(lambda x: sent.polarity_scores(x)['compound'])


df_test_reddit['date'] = pd.to_datetime(df_test_reddit['created_utc'],unit='s')



In [7]:
def get_uwa(item):

    words = [w for w in item.split()]
    #print(words)
    
    index_list = []
    
    for i in range(len(words)):
        
        if (words[i] in list(top_words)):
            index_list.append(i)
    
    dist = 0
    
    dist = np.sum(np.array(index_list[1:]) - np.array(index_list[0:-1]))
    
    return dist

In [8]:
top_words = pickle.load(open('../datasets/top_words_overlap.pkl', 'rb'))

In [9]:
df_test_reddit['wua'] = df_test_reddit['text_merged'].apply(get_uwa)

### Loading the pickled trained model and testing it on the new dataset

In [10]:
lr = pickle.load(open('../datasets/lr.pkl', 'rb'))

In [11]:
X = df_test_reddit['text_merged']

In [12]:
vectorizer = pickle.load(open('../datasets/vectorizer.pkl', 'rb'))
list_stop_words = ["dec", "global", "http", "www", "com", "conspiraci", "warm", "climat", "remov", "theori", "theactualshadow", "co"]



# vectorizer = CountVectorizer(analyzer = "word",
#                              #ngram_range=(1,2),
#                              tokenizer = None,
#                              preprocessor = None,
#                              stop_words = list_stop_words,
#                              max_features = 3000) 


#X = vectorizer.fit_transform(X_train)

#pickle.dump(vectorizer, open('../datasets/vectorizer.pkl', 'wb'))

X_features = vectorizer.transform(X)


In [13]:
X_features.shape

(100, 3000)

In [14]:
pred = lr.predict(X_features)

In [15]:
df_test_reddit.head(5)

Unnamed: 0,author,created_utc,domain,id,num_comments,over_18,post_hint,score,subreddit_subscribers,text_merged,text_char_count,text_word_count,sentiment_score,date,wua
0,Newman1651,1593569625,en.mercopress.com,hj1f11,1,False,Empty,1,25629,sea ice weddel sea decreas one million sq kilo...,59,11,0.0,2020-07-01 02:13:45,5.0
1,CharlieBrown829,1593569315,self.climatechange,hj1c7i,60,False,Empty,1,25629,greenhous ga emiss still go global due pandem,45,8,0.0,2020-07-01 02:08:35,0.0
2,LackmustestTester,1593555082,nature.com,hixhvn,2,False,link,1,25621,floodplain inund spectrum across unit state,43,6,0.0,2020-06-30 22:11:22,0.0
3,iamchitranjanbaghi,1593550152,self.climatechange,hivz86,20,False,Empty,1,25619,put co deep mariana trench co becom liquid pre...,234,41,-0.7506,2020-06-30 20:49:12,10.0
4,coffeecream22,1593549986,self.climatechange,hivxd7,32,False,self,1,25619,anyon scienc base respons michael shellenberg ...,259,36,-0.743,2020-06-30 20:46:26,20.0


In [16]:
pred[18]

1

In [17]:
df_test_reddit["preds"] = pred

- ## Filtered posts as non-global warming:

In [18]:
indices = df_test_reddit[df_test_reddit["preds"]==0].index
indices

Int64Index([7, 39, 43, 47, 55, 91], dtype='int64')

In [19]:
df_test_reddit.loc[indices,:]["text_merged"]

7         main mod server also mod r climateskept remov
39    facebook creat fact check exempt climat denier...
43    jeff bezo say amazon bought name right climat ...
47    lesson pandem hi would like know lesson take c...
55    look volunt design skill part non profit proje...
91    climat movement must unit behind black live ma...
Name: text_merged, dtype: object

In [20]:
df_test_reddit.loc[0,:]["text_merged"]

'sea ice weddel sea decreas one million sq kilomet five year'

In [21]:
df_keep_reddit.loc[47,:]["title"]

'Lessons from the pandemic'

- ## Filtered posts as global warming:

In [22]:
indices = df_test_reddit[df_test_reddit["preds"]==1].index
indices

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
            35, 36, 37, 38, 40, 41, 42, 44, 45, 46, 48, 49, 50, 51, 52, 53, 54,
            56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72,
            73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,
            90, 92, 93, 94, 95, 96, 97, 98, 99],
           dtype='int64')

In [23]:
df_test_reddit.loc[indices,:]["text_merged"]

0     sea ice weddel sea decreas one million sq kilo...
1         greenhous ga emiss still go global due pandem
2           floodplain inund spectrum across unit state
3     put co deep mariana trench co becom liquid pre...
4     anyon scienc base respons michael shellenberg ...
                            ...                        
95                     ghg emiss realli reduc due covid
96    eco friendli altern cement might save world su...
97                power work fossil fuel industri remov
98                             extinct rebellion realli
99                         climat sceptic mod sub remov
Name: text_merged, Length: 94, dtype: object

In [24]:
df_test_reddit.loc[97,:]["text_merged"]

'power work fossil fuel industri remov'

In [25]:
print("Hello World !")

Hello World !
