 Your task is to predict whether a news article is real or fake using the available information.

The dataset that you'll use can be downloaded from and is described at https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset as well as the following references:

You will probably get the most informative information from the content of the articles as well as their titles.

**IMPORTANT NOTE**: You _must_ remove the news agency names from the articles.  For example, if an article is from Reuters, you should remove the word "Reuters" from the article.  This is because the news agency name is a very strong indicator of whether an article is real or fake, and we want you to focus on the content of the article itself.  You can use the following code similar to the following to remove the news agency names:

```python
import re
def remove_news_agency_name(text):
    return re.sub(r"Reuters|AP|New York Times|Washington Post|Business Insider|Atlantic|Fox News|National Review|Talking Points Memo|Buzzfeed News|Guardian|NPR|Vox|CNN|BBC|Bloomberg|Daily Mail", "", text)
```

You have at your disposal several
techniques that you can use to create features from text, including, word embedding, part-of-speech analysis (from SI 330), and so on.  You might want to use CountVectorizer and/or TfidfVectorizer from the
sklearn.feature_extraction library, which are described below.

1. You should pre-process your text using at least some of the steps outlined in lectures (e.g. normalizing to lowercase, splitting into words, etc.).

The articles are provided in two different files: Fake.csv and True.csv.  We recommend that you create a dataframe with the contents of those files combined, including a new column that specifies whether the article is real or fake (note that you can use whatever coding you want for "real" vs. "fake", e.g. 1 and 0, "real" and "fake", "false" and "true" -- whatever works for you.

2. You should split the resulting combined dataframe into training and testing datasets OR use cross-validation.  If you go the splitting-into-training-and-testing route, we recommend an 80-20 split (i.e. training gets 80% of the data; testing gets 20%) and use the testing dataset to report your accuracy score.  If you go the cross-validation route, we recommend using 5-fold cross-validation and use the mean accuracy score for your 5 folds when reporting your accuracy score.


Much like the previous homework assignment, you'll want to try a variety of classifiers and possibly use an ensemble.  And, in a similar way to the previous homework assignment, your submission (to Canvas -- there is no requirement to submit this anywhere else, including Kaggle) should be based on a Jupyter notebook that you create.

3. As as final challenge, we would like you to attempt to characterize each of the datasets in terms of their semantic content.  This might involve extracting the most commonly occurring words (possibly limiting that to specific parts of speech), examining the Named Entities, and extracting keywords by leveraging word embeddings.  Use your imagination, and remember there is no single "correct" answer.  For those of you looking to teach yourself something new, check out Latent Dirichilet Allocation (LDA) using the `gensim` library.  To get started with LDA, check out https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ and https://radimrehurek.com/gensim/models/ldamodel.html.  You are not required to use LDA, but it is a powerful technique for extracting topics from text.


In [26]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

import nltk
from nltk import Text
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize  
from nltk.tokenize import sent_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


In [34]:
!pip install --upgrade nltk




In [2]:
! python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
pip install -U spacy

Note: you may need to restart the kernel to use updated packages.


In [2]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [3]:
#true=1
#fake=0

fake=pd.read_csv('Fake.csv')
true=pd.read_csv('True.csv')

fake['real_vs_fake']=0
true['real_vs_fake']=1

In [4]:
df=pd.concat([fake,true],axis=0)

In [13]:
df

Unnamed: 0,title,text,subject,date,real_vs_fake
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
...,...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1


In [19]:
#preprocessing

#normalize to lowercase
def lower_case(text):
    return text.lower()


df['text']=df['text'].apply(lower_case)
df['title']=df['title'].apply(lower_case)

In [31]:
import nltk
nltk.download('stopwords')
nltk.data.path.append("/path/to/nltk_data")
print(nltk.__version__)

3.8.1


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sarrahahmed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [38]:
#remove punctuation and special characters

#had to use nltk, over 8m and spacy still hasn't finished
#had to go back to spacy bc nltk was not working
import nltk



import nltk
from nltk.corpus import stopwords
import re
stopwords = set(stopwords.words('english'))

def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

def remove_news_agency_name(text):
    return re.sub(r"Reuters|AP|New York Times|Washington Post|Business Insider|Atlantic|Fox News|National Review|Talking Points Memo|Buzzfeed News|Guardian|NPR|Vox|CNN|BBC|Bloomberg|Daily Mail", "", text)

def remove_stopwords(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if token.is_alpha and token.lower_ not in stopwords]
    return tokens


df['text'] = df['text'].apply(remove_punctuation)
df['title'] = df['title'].apply(remove_punctuation)
df['text'] = df['text'].apply(remove_news_agency_name)
df['title'] = df['title'].apply(remove_news_agency_name)
df['tokenized_text'] = df['text'].apply(remove_stopwords)
df['tokenized_title'] = df['title'].apply(remove_stopwords)





In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word',max_df=0.74,max_features=500)
df['combined_text']=df['tokenized_text'].apply(lambda x: ' '.join(x))
txt_fitted=tf.fit(df['combined_text'])
txt_transformed=txt_fitted.transform(df['combined_text'])
feature_names=tf.get_feature_names_out()
tfidf_df=pd.DataFrame(txt_transformed.toarray(),columns=feature_names)

tfidf_df

In [None]:
#putting a model together

from sklearn.model_selection import train_test_split    
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier,VotingClassifier

X_train, X_test, y_train, y_test = train_test_split(tfidf_df, df['real_vs_fake'], test_size=0.2, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
lr_clf = LogisticRegression(random_state=42)
gb_clf = GradientBoostingClassifier(random_state=42)
ada_clf = AdaBoostClassifier(random_state=42)

voting_clf = VotingClassifier(estimators=[('rf', rf_model), ('lr', lr_clf), ('gb', gb_clf), ('ada', ada_clf)], voting='soft')
voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)

In [None]:
vote_accuracy = accuracy_score(y_test, y_pred)

In [None]:
#final challenge

from collections import Counter
import spacy
from gensim.models import Word2Vec

all_tokens=[]
for token_list in df['tokenized_text']:
    all_tokens+=token_list

word_counts=Counter(all_tokens)

top_words=word_counts.most_common(53846)
top_words

In [None]:
nlp=spacy.load('en_core_web_sm')
pos=[]
most_common_words=[word for word, count in top_words]
for word in most_common_words:
    doc=nlp(word)
    for token in doc:
        if token.pos_=='NOUN' or token.pos_=='ADJ':
            pos.append(token.text)

In [None]:
#named entities

text=' '.join(pos)
doc=nlp(text)
entities=[(ent.text,ent.label_) for ent in doc.ents]