# Data Collection

## Brief Introduction

In this notebook will be collected data from CNN Transcript by scraping sites with BeatifulSoup library.
The main aim of this project is to discover trend of topics during 2020 in CNN articles.

# *Questions:* 
* Which are the main topics during the year? 
* 


In [17]:
import requests
from bs4 import BeautifulSoup
import pickle
import pandas as pd
from datetime import datetime

# Scrapes transcript data
def url_to_transcript(url):
    '''Returns transcript data specifically from scrapsfromtheloft.com.'''
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    # Find all classes "cnnSectBulletItems" corrisponding to newspapers titles.
    text = [p.text for p in soup.find(class_="cnnSectBulletItems").find_all('a')]
    print(url)
    return text

urls =[]
dates = pd.date_range(start="2020-01-01",end="2021-01-01")
dates = [str(date)[:10] for date in dates]
dates = [date.replace("-",".") for date in dates]
for date in dates:
    urls.append('http://transcripts.cnn.com/TRANSCRIPTS/' + date + '.html')


In [18]:
# transcripts = [url_to_transcript(u) for u in urls]

In [19]:
# # # Make a new directory to hold the text files
# !mkdir transcripts

# for i, c in enumerate(dates):
#      with open("transcripts/" + c + ".txt", "wb") as file:
#          pickle.dump(transcripts[i], file)

In [20]:
# Load pickled files
data = {}
for i, c in enumerate(dates):
    with open("transcripts/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [21]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['2020.01.01', '2020.01.02', '2020.01.03', '2020.01.04', '2020.01.05', '2020.01.06', '2020.01.07', '2020.01.08', '2020.01.09', '2020.01.10', '2020.01.11', '2020.01.12', '2020.01.13', '2020.01.14', '2020.01.15', '2020.01.16', '2020.01.17', '2020.01.18', '2020.01.19', '2020.01.20', '2020.01.21', '2020.01.22', '2020.01.23', '2020.01.24', '2020.01.25', '2020.01.26', '2020.01.27', '2020.01.28', '2020.01.29', '2020.01.30', '2020.01.31', '2020.02.01', '2020.02.02', '2020.02.03', '2020.02.04', '2020.02.05', '2020.02.06', '2020.02.07', '2020.02.08', '2020.02.09', '2020.02.10', '2020.02.11', '2020.02.12', '2020.02.13', '2020.02.14', '2020.02.15', '2020.02.16', '2020.02.17', '2020.02.18', '2020.02.19', '2020.02.20', '2020.02.21', '2020.02.22', '2020.02.23', '2020.02.24', '2020.02.25', '2020.02.26', '2020.02.27', '2020.02.28', '2020.02.29', '2020.03.01', '2020.03.02', '2020.03.03', '2020.03.04', '2020.03.05', '2020.03.06', '2020.03.07', '2020.03.08', '2020.03.09', '2020.03.10', '2020.03.

In [22]:
# More checks
data['2020.03.20']

["75 Million Americans Ordered to Stay at Home; U.S. Coronavirus Cases Top 18,000, At Least 236 Deaths; Member of VP Pence's Staff Tests Positive for Coronavirus. Aired on 8-9p ET",

In [23]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text
# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [24]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
2020.01.01,Did Not Air 8-9p ET
2020.01.02,Report: Unredacted E-Mails Tie Ukraine Aid Hold Directly to President Trump; Rep. Jackie Speier (D-CA) is Interviewed About New Documents Tying Uk...
2020.01.03,"President Trump: Iranian Commander Killed By U.S. Was Plotting ""Imminent and Sinister"" Attacks Against Americans; Reports: New Airstrike Targets I..."
2020.01.04,"Iran Vows ""Harsh Revenge"" for U.S. Killing of Top Commander; Australian Prime Minister Ramps Up Defense Response as Fires Intensify. Aired 12-1a E..."
2020.01.05,Did Not Air. 5-6p ET Did Not Air. 6-7p ET Iraqi Parliament to Hold Emergency Session; Iran Promises Revenge on U.S. for Assassination of General S...
...,...
2020.12.28,"House Passes Bill For Bigger COVID Relief Check Supported By Trump, Opposed By GOP Senate; Interview With Rep. Adam Kinzinger (R- IL); GOP Congres..."
2020.12.29,"Colorado Health Officials Detect First Known Case In U.S. Of New COVID Strain; McConnell Attempting To Inject ""Poison Pill"" Provisions; Interview ..."
2020.12.30,Faster Moving Virus and Slow Rolling Vaccine Rollout; Interview with Rep. Donna Shalala (D-FL); New COVID Strain Identified in Second State One Da...
2020.12.31,Did Not Air 8-9p ET


As we can see in transcript data we have days with no transcript due to "not air". We can delete those rows.

In [25]:
data_df = data_df[data_df.transcript != "Did Not Air 8-9p ET"]
data_df

Unnamed: 0,transcript
2020.01.02,Report: Unredacted E-Mails Tie Ukraine Aid Hold Directly to President Trump; Rep. Jackie Speier (D-CA) is Interviewed About New Documents Tying Uk...
2020.01.03,"President Trump: Iranian Commander Killed By U.S. Was Plotting ""Imminent and Sinister"" Attacks Against Americans; Reports: New Airstrike Targets I..."
2020.01.04,"Iran Vows ""Harsh Revenge"" for U.S. Killing of Top Commander; Australian Prime Minister Ramps Up Defense Response as Fires Intensify. Aired 12-1a E..."
2020.01.05,Did Not Air. 5-6p ET Did Not Air. 6-7p ET Iraqi Parliament to Hold Emergency Session; Iran Promises Revenge on U.S. for Assassination of General S...
2020.01.06,"U.S. Sends Draft Letter to Iraqi Government Suggesting U.S. Would Withdraw Troops, Then Says It Was A ""Mistake""; Sen. Bernie Sanders (I-VT) is Int..."
...,...
2020.12.26,"Possible Human Remains Found at Blast Site of Van that Exploded in Nashville, Tennessee, on Christmas Day; Authorities Exploring Possibility Van E..."
2020.12.27,Trump Doesn't Sign COVID Relief Bill; Investigators Say Nashville Blast Likely a Suicide Bombing; California Medical Professionals Working to Exha...
2020.12.28,"House Passes Bill For Bigger COVID Relief Check Supported By Trump, Opposed By GOP Senate; Interview With Rep. Adam Kinzinger (R- IL); GOP Congres..."
2020.12.29,"Colorado Health Officials Detect First Known Case In U.S. Of New COVID Strain; McConnell Attempting To Inject ""Poison Pill"" Provisions; Interview ..."


In [47]:
data_df.transcript.loc['2020.01.02']

'Report: Unredacted E-Mails Tie Ukraine Aid Hold Directly to President Trump; Rep. Jackie Speier (D-CA) is Interviewed About New Documents Tying Ukraine Aid Hold Directly to President Trump; Biden Fires Back After Sanders Says "Just A Lot Of Baggage That Joe Takes Into Campaign; Iraqi TV: Head Of Iran Quds Force And Deputy Head Of Iraq Paramilitary Forces Killed In Attack At Baghdad Airport. Aired 8- 9p ET Report Shows Unredacted Emails Tie Ukraine Aid Hold Directly To President Trump; Iraqi T.V. Reports, Senior Iranian Military Official And Deputy Head Of Iraq Paramilitary Forces Killed In Rocket Strike Attack At Baghdad Airport. Aired 9-10p ET'

In [65]:
# Group by month
df_0 = data_df
df_0.index = pd.to_datetime(df_0.index, format='%Y.%m.%d')
df_0 = df_0.groupby([(df_0.index.month)]).agg(lambda x: ''.join(x))
df_0.index = df_0.index.map(str)
df_0

Unnamed: 0,transcript
1,Report: Unredacted E-Mails Tie Ukraine Aid Hold Directly to President Trump; Rep. Jackie Speier (D-CA) is Interviewed About New Documents Tying Uk...
2,Candidates Make Final Push Before Iowa Caucuses; Final Sprint To Iowa Exposes Divide In Democratic Party; Iowa Voters Torn Between Candidates Ahea...
3,U.S. Confirms Its First Death from Coronavirus; Joe Biden Wins South Carolina Primary; Turkey Says It Won't Stop Refugees Headed to Europe; Chines...
4,"Trump: Protective Gear In National Stockpile Nearly Depleted; 211,600-Plus Coronavirus Cases In U.S., At Least 4,745 Deaths; Over 87 Percent Of Am..."
5,CDC Report: Limited Testing and Continued Travel Fueled Early Coronavirus Transmission in U.S.; FDA Approves Emergency Use Of Remdesivir For COVID...
6,Trump Vows To Deploy U.S. Military If States Can't Control Protests; Interview With Gov. Gretchen Whitmer (D-MI); Trump Vows To Deploy U.S. Milita...
7,Just Two States In The Entire Nation Show COVID-19 Infections Declining; Interview With Rep. Donna Shalala (D-FL); U.S. Sees Highest Single Day Of...
8,"In The U.S., 4.5 Million Confirmed COVID-19 Cases And Almost 152,000 American Lives Now Lost; Russia's Sputnik Moment With A Coronavirus Vaccine; ..."
9,"Donald Trump Visits Kenosha, Wisconsin; Politics of Fear; Convalescent Plasma Not Recommended To Treat COVID-19; New White House Adviser Pushing C..."
10,First Lady's Former Friend And Ex-East Wing Adviser Shares Audio Recordings Of Their Phone Calls; WH Adviser Hope Hicks Tests Positive For Covid-1...


In [66]:
df_0.transcript.loc['1']

'Report: Unredacted E-Mails Tie Ukraine Aid Hold Directly to President Trump; Rep. Jackie Speier (D-CA) is Interviewed About New Documents Tying Ukraine Aid Hold Directly to President Trump; Biden Fires Back After Sanders Says "Just A Lot Of Baggage That Joe Takes Into Campaign; Iraqi TV: Head Of Iran Quds Force And Deputy Head Of Iraq Paramilitary Forces Killed In Attack At Baghdad Airport. Aired 8- 9p ET Report Shows Unredacted Emails Tie Ukraine Aid Hold Directly To President Trump; Iraqi T.V. Reports, Senior Iranian Military Official And Deputy Head Of Iraq Paramilitary Forces Killed In Rocket Strike Attack At Baghdad Airport. Aired 9-10p ETPresident Trump: Iranian Commander Killed By U.S. Was Plotting "Imminent and Sinister" Attacks Against Americans; Reports: New Airstrike Targets Iran-Backed Militia Near Baghdad; Sen. Chris Van Hollen (D-MD) is Interviewed About Iranian Commander Killed By U.S.; Trump: Iranian Commander Killed by U.S. Was Plotting Attacks Against Americans; Wash

In [70]:
# Apply a first round of text cleaning techniques
import re
import string
data_df = df_0
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [71]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean.transcript.loc['1']

'report unredacted emails tie ukraine aid hold directly to president trump rep jackie speier dca is interviewed about new documents tying ukraine aid hold directly to president trump biden fires back after sanders says just a lot of baggage that joe takes into campaign iraqi tv head of iran quds force and deputy head of iraq paramilitary forces killed in attack at baghdad airport aired   et report shows unredacted emails tie ukraine aid hold directly to president trump iraqi tv reports senior iranian military official and deputy head of iraq paramilitary forces killed in rocket strike attack at baghdad airport aired  etpresident trump iranian commander killed by us was plotting imminent and sinister attacks against americans reports new airstrike targets iranbacked militia near baghdad sen chris van hollen dmd is interviewed about iranian commander killed by us trump iranian commander killed by us was plotting attacks against americans washington post some democrats privately worried a

In [72]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    text = re.sub('aired', '', text)
    text = re.sub(' [a-z](?= )', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [73]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
1,report unredacted emails tie ukraine aid hold directly to president trump rep jackie speier dca is interviewed about new documents tying ukraine a...
2,candidates make final push before iowa caucuses final sprint to iowa exposes divide in democratic party iowa voters torn between candidates ahead ...
3,us confirms its first death from coronavirus joe biden wins south carolina primary turkey says it wont stop refugees headed to europe chinese comp...
4,trump protective gear in national stockpile nearly depleted coronavirus cases in us at least deaths over percent of americans under stayathome ...
5,cdc report limited testing and continued travel fueled early coronavirus transmission in us fda approves emergency use of remdesivir for patients...
6,trump vows to deploy us military if states cant control protests interview with gov gretchen whitmer dmi trump vows to deploy us military if state...
7,just two states in the entire nation show infections declining interview with rep donna shalala dfl us sees highest single day of coronavirus cas...
8,in the us million confirmed cases and almost american lives now lost russias sputnik moment with coronavirus vaccine adm giroir answers coronav...
9,donald trump visits kenosha wisconsin politics of fear convalescent plasma not recommended to treat new white house adviser pushing controversial...
10,first ladys former friend and exeast wing adviser shares audio recordings of their phone calls wh adviser hope hicks tests positive for after tra...


In [15]:
index_copy = data_df.index.copy()
data_df['dates'] = index_copy
data_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,transcript,dates
2020.01.02,Report: Unredacted E-Mails Tie Ukraine Aid Hold Directly to President Trump; Rep. Jackie Speier (D-CA) is Interviewed About New Documents Tying Uk...,2020.01.02
2020.01.03,"President Trump: Iranian Commander Killed By U.S. Was Plotting ""Imminent and Sinister"" Attacks Against Americans; Reports: New Airstrike Targets I...",2020.01.03
2020.01.04,"Iran Vows ""Harsh Revenge"" for U.S. Killing of Top Commander; Australian Prime Minister Ramps Up Defense Response as Fires Intensify. Aired 12-1a E...",2020.01.04
2020.01.05,Did Not Air. 5-6p ET Did Not Air. 6-7p ET Iraqi Parliament to Hold Emergency Session; Iran Promises Revenge on U.S. for Assassination of General S...,2020.01.05
2020.01.06,"U.S. Sends Draft Letter to Iraqi Government Suggesting U.S. Would Withdraw Troops, Then Says It Was A ""Mistake""; Sen. Bernie Sanders (I-VT) is Int...",2020.01.06
...,...,...
2020.12.26,"Possible Human Remains Found at Blast Site of Van that Exploded in Nashville, Tennessee, on Christmas Day; Authorities Exploring Possibility Van E...",2020.12.26
2020.12.27,Trump Doesn't Sign COVID Relief Bill; Investigators Say Nashville Blast Likely a Suicide Bombing; California Medical Professionals Working to Exha...,2020.12.27
2020.12.28,"House Passes Bill For Bigger COVID Relief Check Supported By Trump, Opposed By GOP Senate; Interview With Rep. Adam Kinzinger (R- IL); GOP Congres...",2020.12.28
2020.12.29,"Colorado Health Officials Detect First Known Case In U.S. Of New COVID Strain; McConnell Attempting To Inject ""Poison Pill"" Provisions; Interview ...",2020.12.29


In [74]:
# Let's pickle it for later use
data_df.to_pickle("corpus.pkl")

In [75]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
#copy paste, regular code for vectorization
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,abandon,abc,abcs,abe,aberys,abetting,able,aboard,abortion,abruptly,...,zakaria,zamboni,zealand,zelensky,zero,zeta,zhong,zone,zooms,zwaan
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,2,0,0
2,2,0,0,0,0,0,0,1,0,0,...,0,1,0,5,0,0,0,0,0,0
3,0,0,0,1,0,0,0,4,0,0,...,0,1,0,0,2,0,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
6,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
8,0,0,0,0,0,0,2,0,0,1,...,0,0,1,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,1,...,1,0,4,0,0,1,0,0,0,0


In [76]:
# Let's pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [77]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))