## Introduction/Background
The goal of the project is to generate news headlines from the text of an article. This is a __text summarization__ problem, as the headline of an article should be a one to two sentence summary of what the reader should expect to read in the article. The data set I will be using, nyt_news, was created using news-please, a news web-scraper. The web-scraper is provided a url(s) and searches those sites for articles, then saves the articles as an HTML file, and a JSON file. For my project I used https://www.nytimes.com as the URL to scrape. While news-please creates many fields from the website, the fields I used from the dataset are _title_ and _maintext_, which represent the titles of news articles and the text content of the article.


## Exploratory Data Analysis

In [27]:
# import all of the python modules/packages you'll need here
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from nltk.tokenize import word_tokenize

df = pd.read_csv('nytimes_news.csv') # Load the dataset from a CSV

print(df.head())

   Unnamed: 0                           authors        date_download  \
0           0                   ['John Leland']  2022-10-25 23:38:21   
1           0                                []  2022-10-25 23:47:05   
2           0                  ['Eric Schmitt']  2022-10-25 23:53:44   
3           0  ['Elena Bergeron', 'Ken Belson']  2022-10-25 23:28:05   
4           0               ['Nick Corasaniti']  2022-10-25 23:27:22   

           date_modify         date_publish  \
0  2022-10-25 23:38:21  2011-12-17 02:07:26   
1  2022-10-25 23:47:05  2019-08-16 13:53:18   
2  2022-10-25 23:53:44  2018-10-27 14:44:43   
3  2022-10-25 23:28:05  2020-10-03 16:06:18   
4  2022-10-25 23:27:22  2021-03-25 21:31:32   

                                         description  \
0  Dennis Crowley, a co-founder of the social net...   
1                                                NaN   
2  Defense Secretary Jim Mattis’s remarks reflect...   
3  Cam Newton was among those who tested positive...   
4  T

In [28]:
# Here we clean the data.
df = df.dropna(subset =['title', 'maintext']) # remove entries where the title or maintext are missing.
df = df.drop(columns=['Unnamed: 0','date_download','date_modify','date_publish','authors','filename','image_url','localpath','title_rss','source_domain','url','site','title_page'])
df.head()

Unnamed: 0,description,language,title,maintext
0,"Dennis Crowley, a co-founder of the social net...",en,"On Sundays, Foursquare Co-founder Goes Online,...",NIBBLE OF NEW YORK I don’t have a set schedule...
2,Defense Secretary Jim Mattis’s remarks reflect...,en,Mattis Vows U.S. Will Hold Khashoggi’s Killers...,Mr. Trump did not elaborate on the conversatio...
3,Cam Newton was among those who tested positive...,en,Patriots-Chiefs Game Postponed After Positive ...,The league has followed Major League Baseball ...
4,"The law, which has been denounced by Democrats...",en,Georgia G.O.P. Passes Major Law to Limit Voting,"In brief remarks on Thursday evening, Mr. Kemp..."
5,I’m the art editor at The Times. Here are five...,en,What’s in Our Queue? Young Fathers and More,What’s in Our Queue? Young Fathers and More Ba...


Now let's look at our data. Run the cell below to see some statistics of maintext, the article text we are most concerned with since we will be summarizing it with a headline.

In [29]:
df['text_len'] = df['maintext'].str.len()
#data = df['text_len']
#plt.bar(df['text_len'],10,data=data)
#plt.show()
df.maintext.describe()
#print(f"Mean # of characters in text: {df['text_len'].mean()}")

count                                                 11157
unique                                                10819
top       Send any friend a story\nAs a subscriber, you ...
freq                                                     18
Name: maintext, dtype: object

Let's look at some information on the length of an article, by character length.

In [30]:
df.text_len.describe()

count    11157.000000
mean      2033.284395
std       3236.360730
min         95.000000
25%       1068.000000
50%       1393.000000
75%       2021.000000
max      85182.000000
Name: text_len, dtype: float64

Now to tokenize the text data.

In [31]:
df['raw_tokens'] = df['maintext'].apply(lambda x: word_tokenize(x.lower()))
# df['raw_tokens'] = df['maintext'].apply(lambda x: x.split()) # used whitespace tokenization for now because nltk is being rude.
df.head()

Unnamed: 0,description,language,title,maintext,text_len,raw_tokens
0,"Dennis Crowley, a co-founder of the social net...",en,"On Sundays, Foursquare Co-founder Goes Online,...",NIBBLE OF NEW YORK I don’t have a set schedule...,1212,"[nibble, of, new, york, i, don, ’, t, have, a,..."
2,Defense Secretary Jim Mattis’s remarks reflect...,en,Mattis Vows U.S. Will Hold Khashoggi’s Killers...,Mr. Trump did not elaborate on the conversatio...,1781,"[mr., trump, did, not, elaborate, on, the, con..."
3,Cam Newton was among those who tested positive...,en,Patriots-Chiefs Game Postponed After Positive ...,The league has followed Major League Baseball ...,1407,"[the, league, has, followed, major, league, ba..."
4,"The law, which has been denounced by Democrats...",en,Georgia G.O.P. Passes Major Law to Limit Voting,"In brief remarks on Thursday evening, Mr. Kemp...",2344,"[in, brief, remarks, on, thursday, evening, ,,..."
5,I’m the art editor at The Times. Here are five...,en,What’s in Our Queue? Young Fathers and More,What’s in Our Queue? Young Fathers and More Ba...,356,"[what, ’, s, in, our, queue, ?, young, fathers..."


In [32]:
from nltk.corpus import stopwords
# Code by sgeinitz, https://github.com/sgeinitz/cs39aa_notebooks/blob/main/nb_C_airline_tweets_take2.ipynb
# Thank you!
stops = set(stopwords.words('english'))
chars2remove = set(['.','!','/', '?'])
df['raw_tokens'] = df['raw_tokens'].apply(lambda x: [w for w in x if w not in stops])
df['raw_tokens'] = df['raw_tokens'].apply(lambda x: [w for w in x if w not in chars2remove])
df['raw_tokens'] = df['raw_tokens'].apply(lambda x: [w for w in x if not re.match('^#', w)])
df.head()

Unnamed: 0,description,language,title,maintext,text_len,raw_tokens
0,"Dennis Crowley, a co-founder of the social net...",en,"On Sundays, Foursquare Co-founder Goes Online,...",NIBBLE OF NEW YORK I don’t have a set schedule...,1212,"[nibble, new, york, ’, set, schedule, go, long..."
2,Defense Secretary Jim Mattis’s remarks reflect...,en,Mattis Vows U.S. Will Hold Khashoggi’s Killers...,Mr. Trump did not elaborate on the conversatio...,1781,"[mr., trump, elaborate, conversation, appeared..."
3,Cam Newton was among those who tested positive...,en,Patriots-Chiefs Game Postponed After Positive ...,The league has followed Major League Baseball ...,1407,"[league, followed, major, league, baseball, re..."
4,"The law, which has been denounced by Democrats...",en,Georgia G.O.P. Passes Major Law to Limit Voting,"In brief remarks on Thursday evening, Mr. Kemp...",2344,"[brief, remarks, thursday, evening, ,, mr., ke..."
5,I’m the art editor at The Times. Here are five...,en,What’s in Our Queue? Young Fathers and More,What’s in Our Queue? Young Fathers and More Ba...,356,"[’, queue, young, fathers, barbara, graustark,..."
