# Project Pt. I

## Analyze & modify the dataset to suit our training needs

In [1]:
import pandas as pd

In [2]:
""" Convert data into a dataframe"""
df= pd.read_json("NLPCSS-20-main/data/data.json")
df.head()

Unnamed: 0,source,title,event_id,adfontes_fair,adfontes_political,allsides_bias,content,misc
0,Fox News,"Trump blasts Howard Schultz, says ex-Starbucks...",0,bias,bias,From the Right,Obama administration alum Roger Fisk and Repub...,"{'time': '2019-01-28 16:10:44.680484', 'topics..."
1,USA TODAY,Trump blasts former Starbucks CEO Howard Schul...,0,bias,neutral,From the Center,WASHINGTON – President Donald Trump took a swi...,"{'time': 'None', 'topics': 'Election: Presiden..."
2,Washington Times,Mick Mulvaney: Trump to secure border 'with or...,0,bias,neutral,From the Right,Acting White House chief of staff Mick Mulvane...,"{'time': 'None', 'topics': 'White House', 'aut..."
3,Washington Times,Trump says 'we'll do the emergency' if border ...,0,bias,neutral,From the Right,President Trump repeated his vow Friday to dec...,"{'time': 'None', 'topics': 'White House, Polit..."
4,BBC News,Trump backs down to end painful shutdown tempo...,0,bias,neutral,From the Center,President Donald Trump has yielded to politica...,"{'time': '2019-01-26 00:00:00', 'topics': 'Whi..."


Explore dataframe with pandas, as well as modify the dataset

In [3]:
# pd info
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7775 entries, 0 to 7774
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   source              7775 non-null   object
 1   title               7775 non-null   object
 2   event_id            7775 non-null   int64 
 3   adfontes_fair       7775 non-null   object
 4   adfontes_political  7775 non-null   object
 5   allsides_bias       7775 non-null   object
 6   content             7775 non-null   object
 7   misc                7775 non-null   object
dtypes: int64(1), object(7)
memory usage: 486.1+ KB
None


We will train our data with only content and allsides-bias, so we can drop the other columns in the dataset. We will also want to get rid of null entries.

In [4]:
#drop all columns except content and allsides bias
df.drop(columns=['source','title','event_id','misc','adfontes_fair','adfontes_political'], inplace=True)
df.head()

Unnamed: 0,allsides_bias,content
0,From the Right,Obama administration alum Roger Fisk and Repub...
1,From the Center,WASHINGTON – President Donald Trump took a swi...
2,From the Right,Acting White House chief of staff Mick Mulvane...
3,From the Right,President Trump repeated his vow Friday to dec...
4,From the Center,President Donald Trump has yielded to politica...


In [5]:
"""Check to see if there are null entries"""
null_entries= df.isnull().sum()
null_entries
#no null entries, del null vars
del(null_entries)

In [6]:
#look at reformatted dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7775 entries, 0 to 7774
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   allsides_bias  7775 non-null   object
 1   content        7775 non-null   object
dtypes: object(2)
memory usage: 121.6+ KB
None


With pandas analyse the new reformatted dataframe

In [7]:
print("The shape of the dataframe is: ", df.shape, "\n")
print("The unique values for allsides bias is: \n", df.allsides_bias.unique(),"\n")
print("The value counts of the unique values is:")
df.allsides_bias.value_counts(normalize=True)

The shape of the dataframe is:  (7775, 2) 

The unique values for allsides bias is: 
 ['From the Right' 'From the Center' 'From the Left'] 

The value counts of the unique values is:


From the Left      0.473826
From the Right     0.366688
From the Center    0.159486
Name: allsides_bias, dtype: float64

## Tokenize dataframe and Create splits

Create tokens from the dataset

In [8]:
# import libraries to tokenize df
import nltk
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/msalvador45/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
# Tokenize dataframe
df['tokens_raw']= df['content'].apply(word_tokenize)
df.head()

Unnamed: 0,allsides_bias,content,tokens_raw
0,From the Right,Obama administration alum Roger Fisk and Repub...,"[Obama, administration, alum, Roger, Fisk, and..."
1,From the Center,WASHINGTON – President Donald Trump took a swi...,"[WASHINGTON, –, President, Donald, Trump, took..."
2,From the Right,Acting White House chief of staff Mick Mulvane...,"[Acting, White, House, chief, of, staff, Mick,..."
3,From the Right,President Trump repeated his vow Friday to dec...,"[President, Trump, repeated, his, vow, Friday,..."
4,From the Center,President Donald Trump has yielded to politica...,"[President, Donald, Trump, has, yielded, to, p..."


Get rid of stopwords

In [10]:
stops= set(stopwords.words('english'))
chars2remove= set(['.','!','/','?'])
df['tokens_raw']= df['tokens_raw'].apply(lambda x: [w for w in x if w not in stops])
df['tokens_raw']= df['tokens_raw'].apply(lambda x: [w for w in x if w not in chars2remove])
df['tokens_raw']= df['tokens_raw'].apply(lambda x: [w for w in x if not re.match('^#',w)])
df['tokens_raw']= df['tokens_raw'].apply(lambda x: [w for w in x if not re.match('^http',w)])
df['tokens_raw']= df['tokens_raw'].apply(lambda x: [w for w in x if not re.match('@',w)])
df.head()

Unnamed: 0,allsides_bias,content,tokens_raw
0,From the Right,Obama administration alum Roger Fisk and Repub...,"[Obama, administration, alum, Roger, Fisk, Rep..."
1,From the Center,WASHINGTON – President Donald Trump took a swi...,"[WASHINGTON, –, President, Donald, Trump, took..."
2,From the Right,Acting White House chief of staff Mick Mulvane...,"[Acting, White, House, chief, staff, Mick, Mul..."
3,From the Right,President Trump repeated his vow Friday to dec...,"[President, Trump, repeated, vow, Friday, decl..."
4,From the Center,President Donald Trump has yielded to politica...,"[President, Donald, Trump, yielded, political,..."


We will now go forward and lemmatize the tokens

In [20]:
"""We will use Spacy to lemmatize as  they use a more complicated model"""
#import libraries we will need to use
import spacy
lemmatizer= spacy.load("en_core_web_sm", disable=['parser', 'ner'])

In [22]:
#Create lemmatizer instance
doc= nlp("we threw a party, we threw a party, bitches came over yeah we threw a party")
" ".join([token.lemma_ for token in doc])

'we throw a party , we throw a party , bitch come over yeah we throw a party'