# Exploratory Data Analyst of New York Times News Dataset

As one of the most renowned online news platforms globally, The New York Times stands out for its exceptional ability to engage and connect with its readers. What sets this publication apart from others is its unique capacity to foster meaningful interactions with its audience. This dataset offers a wealth of information, presenting a valuable opportunity to analyze and gain insights from the extensive collection of news articles available through The New York Times.

### Columns:
- **abstract:** A brief summary or description of the article's content.
- **web_url:** The web address or URL of the article.
- **snippet:** A short excerpt or snippet from the article.
- **lead_paragraph:** The introductory paragraph of the article.
- **print_section:** The section in the print version of the newspaper where the article was published.
- **print_page:** The page number in the print version of the newspaper where the article was published.
- **source:** The source or provider of the article
- **multimedia:** Information about any multimedia content associated with the article, such as images or videos.
- **headline:** The title or heading of the article.
- **keywords:** Tags or keywords associated with the article, providing insights into its content.

In [3]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

In [4]:
nyt_df = pd.read_csv('../Data/Original/nyt-metadata.csv')
nyt_df.head()

  nyt_df = pd.read_csv('../Data/Original/nyt-metadata.csv')


Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,Article on upcoming New York Giants-Dallas Cow...,https://www.nytimes.com/2000/01/01/sports/pro-...,Article on upcoming New York Giants-Dallas Cow...,Waiting in the visiting locker room at Texas S...,D,2.0,The New York Times,[],"{'main': 'Playoffs or No, Dallas Provides The ...","[{'name': 'organizations', 'value': 'New York ...",2000-01-01 05:00:00+00:00,article,Sports Desk,Sports,"{'original': 'By Bill Pennington', 'person': [...",News,nyt://article/01111a48-3502-5021-8096-bc929379...,819.0,nyt://article/01111a48-3502-5021-8096-bc929379...,
1,Jeanne C Pond letter expresses hope that spiri...,https://www.nytimes.com/2000/01/01/opinion/l-o...,Jeanne C Pond letter expresses hope that spiri...,To the Editor:,A,30.0,The New York Times,[],"{'main': 'On This First Day, a Fanfare for the...","[{'name': 'persons', 'value': 'Pond, Jeanne C'...",2000-01-01 05:00:00+00:00,article,Editorial Desk,Opinion,"{'original': '', 'person': [], 'organization':...",Letter,nyt://article/02328edc-dad1-5eb0-900e-917162e4...,122.0,nyt://article/02328edc-dad1-5eb0-900e-917162e4...,
2,Many experts on Y2K computer problem report th...,https://www.nytimes.com/2000/01/01/us/1-1-00-t...,Many experts on Y2K computer problem report th...,As the world slid nervously yesterday through ...,A,10.0,The New York Times,[],"{'main': ""Internet's Cheering Squad Nervously ...","[{'name': 'subject', 'value': 'Electronic Mail...",2000-01-01 05:00:00+00:00,article,National Desk,U.S.,"{'original': 'By Barnaby J. Feder', 'person': ...",News,nyt://article/02a8f89b-153f-5b84-983c-e328de5b...,761.0,nyt://article/02a8f89b-153f-5b84-983c-e328de5b...,
3,WILL the forces of globalism continue to push ...,https://www.nytimes.com/2000/01/01/news/vision...,,WILL the forces of globalism continue to push ...,E,4.0,The New York Times,[],{'main': 'Economic Thinking Finds a Free Marke...,[],2000-01-01 05:00:00+00:00,article,The Millennium,Archives,"{'original': 'By Floyd Norris', 'person': [{'f...",News,nyt://article/0634d837-97b8-59b5-aa17-f90d1a89...,916.0,nyt://article/0634d837-97b8-59b5-aa17-f90d1a89...,
4,SPECIAL TODAY The Millennium Envisioning th...,https://www.nytimes.com/2000/01/01/nyregion/in...,,SPECIAL TODAY,A,1.0,The New York Times,[],"{'main': 'INSIDE', 'kicker': None, 'content_ki...",[],2000-01-01 05:00:00+00:00,article,Metropolitan Desk,New York,"{'original': '', 'person': [], 'organization':...",Summary,nyt://article/0654cc64-c37f-594d-9290-1ce578cd...,102.0,nyt://article/0654cc64-c37f-594d-9290-1ce578cd...,


### Structure of Dataset

In [6]:
nyt_df.shape

(2191867, 20)

In [9]:
nyt_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2191867 entries, 0 to 2191866
Data columns (total 20 columns):
 #   Column            Dtype  
---  ------            -----  
 0   abstract          object 
 1   web_url           object 
 2   snippet           object 
 3   lead_paragraph    object 
 4   print_section     object 
 5   print_page        object 
 6   source            object 
 7   multimedia        object 
 8   headline          object 
 9   keywords          object 
 10  pub_date          object 
 11  document_type     object 
 12  news_desk         object 
 13  section_name      object 
 14  byline            object 
 15  type_of_material  object 
 16  _id               object 
 17  word_count        float64
 18  uri               object 
 19  subsection_name   object 
dtypes: float64(1), object(19)
memory usage: 334.5+ MB


In [25]:
#Modify date to Analysis
nyt_df['pub_date'] = pd.to_datetime(nyt_df['pub_date'], errors='coerce')
nyt_df['pub_date'] = nyt_df['pub_date'].dt.date
nyt_df['pub_date'].head(2)

0    2000-01-01
1    2000-01-01
Name: pub_date, dtype: object

### Summarize the Data

In [26]:
#Get missing values 
missing_per_column = nyt_df.isnull().sum()
print(missing_per_column)

abstract              31917
web_url                   1
snippet              196167
lead_paragraph        41989
print_section        746774
print_page           748022
source                    1
multimedia                1
headline                  1
keywords                  1
pub_date                  2
document_type             1
news_desk            301299
section_name           2010
byline                    1
type_of_material      85290
_id                       1
word_count                2
uri                       2
subsection_name     1603587
dtype: int64


In [29]:
#Oldest and Newest days
nyt_df = nyt_df.dropna(subset=['pub_date'])
print("Oldst Date: ", nyt_df['pub_date'].min())
print("Recent Date: ", nyt_df['pub_date'].max())

Oldst Date:  2000-01-01
Recent Date:  2025-05-01


In [30]:
#News by day of first days
counts = nyt_df['pub_date'].value_counts().sort_index()
print(counts.head(20))

pub_date
2000-01-01    244
2000-01-02    477
2000-01-03    184
2000-01-04    258
2000-01-05    239
2000-01-06    253
2000-01-07    252
2000-01-08    195
2000-01-09    550
2000-01-10    175
2000-01-11    272
2000-01-12    268
2000-01-13    303
2000-01-14    211
2000-01-15    178
2000-01-16    536
2000-01-17    161
2000-01-18    221
2000-01-19    229
2000-01-20    276
Name: count, dtype: int64


In [31]:
#News by day of last days
print(counts.tail(20))

pub_date
2024-12-17    154
2024-12-18    148
2024-12-19    167
2024-12-20    195
2024-12-21     84
2024-12-22     57
2024-12-23    115
2024-12-24     96
2024-12-25     70
2024-12-26     95
2024-12-27    102
2024-12-28     60
2024-12-29     74
2024-12-30    108
2024-12-31     68
2025-01-01      2
2025-02-01     24
2025-03-01      7
2025-04-01    164
2025-05-01    186
Name: count, dtype: int64


In [32]:
#All variables from type_of_material
nyt_df['type_of_material'].unique()

array(['News', 'Letter', 'Summary', 'Paid Death Notice', 'Chronology',
       'Review', 'Op-Ed', 'Obituary; Biography', 'Correction', 'List',
       'Interview', 'Biography', 'An Analysis; News Analysis',
       'Editorial', 'Statistics', "Editors' Note", 'Schedule', 'Series',
       'Question', 'Special Report', 'Text', 'Op-Ed; Caption',
       'Transcript', 'Biography; Obituary', 'Interview; Text',
       'Chronology; Special Report', 'Series; Interview', 'An Appraisal',
       'Series; Text', 'Caption; Op-Ed', 'Special Report; Chronology',
       'Text; Interview', 'An Analysis', nan, 'Series; Chronology',
       'An Analysis; Military Analysis', 'Chronology; Series',
       'Series; Biography', 'News Analysis', 'QandA', 'Results Listing',
       'Profile', 'Slideshow', 'Obituary',
       'An Analysis; Economic Analysis', 'Sidebar',
       'Chronology; An Analysis; News Analysis', 'Interview; Series',
       'Video', 'Quote', 'Biography; Series', 'Editorial; Series',
       'Series;

In [41]:
counts_type_material = nyt_df['type_of_material'].value_counts().sort_index()
print(counts_type_material.head(30))

type_of_material
Addendum                                      3
An Analysis                                  87
An Analysis; Economic Analysis               19
An Analysis; Military Analysis               68
An Analysis; News Analysis                 1949
An Analysis; News Analysis; Chronology        1
An Appraisal                                235
Audio Podcast                                 9
Biography                                  2786
Biography; Chronology                         1
Biography; Obituary                          17
Biography; Series                             1
Brief                                     21527
Caption                                      85
Caption; Editorial                            1
Caption; Op-Ed                               21
Chronology                                  246
Chronology; An Analysis; News Analysis        2
Chronology; Series                            1
Chronology; Special Report                    2
Correction             

In [38]:
most_common = counts_type_material.idxmax()
max_count = counts_type_material.max()

print(f"Most common type_of_material: {most_common} ({max_count} times)")

Most common type_of_material: News (1417125 times)


In [40]:
less_common = counts_type_material.idxmin()
min_count = counts_type_material.min()

print(f"Most common type_of_material: {less_common} ({min_count} times)")

Most common type_of_material: An Analysis; News Analysis; Chronology (1 times)
