<a href="https://colab.research.google.com/github/xavierw39/Twitter-Text-Analysis/blob/main/API_Data_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
### Background
On April 14, 2022, Elon Musk, the founder of Tesla and SpaceX, announced his intention to acquire Twitter. The acquisition was settled six months later. In these months, many incidents have happened that reflected the confrontations and negotiations between the social platform and the business magnate.

### Project Introduction
This project uses Newsdata API to extract from the web about 5000 Twitter-related news articles from May to November 2022. This project includes:  
* a short description of API data extraction process
* text processing 
* EDA on the news text, and 
* the use of several NLP methods aimed to identify the trending news topics and the buzzwords regarding Twitter

## API News Extraction
This project collects its news article data mainly through [Newsdata.io](https://newsdata.io/). There are also many other great news APIs that you can try out such as News API, Bloomberg API, and Guardian API, etc. 

Since I have tried both News API and Newsdata.io, here are the sample code of API extraction for both APIs.

### News API
News API provides two ways to extract news, one is to extract top headline news, which allow you to specify parameters like categories, language, and sources, but not other conditions such as time. The another one is to extract all news which we can specify more parameters.

See more about News API documentation [here](https://newsapi.org/docs/endpoints). 

In [None]:
# pip install newsapi
from newsapi import NewsApiClient

In [None]:
# Init
newsapi = NewsApiClient(api_key='your_newsapi_key')

# /v2/top-headlines
top_headlines = newsapi.get_top_headlines(q='twitter',
                                          #sources='bbc-news,the-verge',
                                          category='business',
                                          language='en',
                                          country='us')


In [None]:
# /v2/everything
sample_article = newsapi.get_everything(q='twitter',
                                      #sources='bbc-news,the-verge',
                                      #domains='bbc.co.uk,techcrunch.com',
                                      from_param='2022-10-04',
                                      to='2022-11-03',
                                      language='en',
                                      sort_by='relevancy')


In [None]:
from pprint import pprint

In [None]:
len(sample_article['articles'])

100

In [None]:
import pandas as pd
sample_df = pd.DataFrame(sample_article['articles']) # save results into df

In [None]:
sample_df = sample_df[sample_df['title'].str.contains('Twitter')] # filter news titles only containing "Twitter"

In [None]:
sample_df.columns # shows the columns of the extracted data

Index(['source', 'author', 'title', 'description', 'url', 'urlToImage',
       'publishedAt', 'content'],
      dtype='object')

### Newsdata.io

In [1]:
# !pip install newsdataapi

In [None]:
from newsdataapi import NewsDataApiClient
# API key authorization, Initialize the client with your API key

api = NewsDataApiClient(apikey="your_newsdataio_key")

# You can pass empty or with request parameters {ex. (country = "us")}

response = api.news_api(q = 'twitter', country = 'us', language = 'en', category = 'business', page = 1)

In [None]:
len(response['results'])

0

### Extract Twitter news from 05-24 to 11-05 in the business category.

In [None]:
import requests
url = "https://newsdata.io/api/1/archive?apikey=your_newsdataio_key&\
q=twitter&country=us&language=en&category=business&from_date=2022-04-14&to_date=2022-11-05"
response = requests.request("GET", url)

In [None]:
from pprint import pprint
import json

In [None]:
response.text



In [None]:
response.json()['results'][1]

{'title': 'Twitter-Elon Musk Timeline: Pay-for-Verification Appears in App, Dorsey Speaks - CNET',
 'link': 'https://news.google.com/__i/rss/rd/articles/CBMidGh0dHBzOi8vd3d3LmNuZXQuY29tL25ld3Mvc29jaWFsLW1lZGlhL3R3aXR0ZXItZWxvbi1tdXNrLXRpbWVsaW5lLXBheS1mb3ItdmVyaWZpY2F0aW9uLWFwcGVhcnMtaW4tYXBwLWRvcnNleS1zcGVha3Mv0gEA?oc=5',
 'keywords': None,
 'creator': None,
 'video_url': None,
 'description': "Twitter-Elon Musk Timeline: Pay-for-Verification Appears in App, Dorsey Speaks\xa0\xa0CNETElon Musk blasts AOC: 'Not everything AOC says is 100% accurate'\xa0\xa0Fox BusinessElon Musk's Response to AOC's Idiotic Attack Is Perfect | ROUNDTABLE | Rubin Report\xa0\xa0The Rubin ReportDoes anyone really think Elon Musk cares about supporting creatives on Twitter?\xa0\xa0The GuardianWhich Scottish politicians will wear Elon Musk's blue tick of shame?\xa0\xa0HeraldScotlandView Full Coverage on Google News",
 'content': None,
 'pubDate': '2022-11-05 23:22:00',
 'image_url': None,
 'source_id': 'google'

**Page parameter**

Since each page has a limit of 100 articles, specifying "page" allow us to extract articles from multiple pages.

In [None]:
data = []
new_results = True
page = 1
while new_results:
  url = "https://newsdata.io/api/1/archive?apikey=your_newsdataio_key&\
q=twitter&country=us&language=en&category=business&from_date=2022-04-14&to_date=2022-11-05&page=" + str(page)
  response_API = requests.request("GET", url).json()
  new_results = response_API.get('results', [])
  data.extend(new_results)
  page += 1

In [None]:
with open('data0524_1105.json', 'w') as f:
    json.dump(data, f) 
# save extracted articles in a json file.

**Disclaimer**: Depending on the API plans you choose, some parameters (e.g. from_date, to_date) may not be available in your plan. This is NOT an advertisement of any APIs mentioned in the notebook, and since these APIs are not cheap, I highly recommend people to carefully make their choices based on financial capability and necessities.

Read json file into dataframe

In [None]:
import json
news_js = open('data0524_1105.json').readlines()
news_lst = []
for line in news_js:
  news_lst.extend(json.loads(line))

In [None]:
len(news_lst)

1032

In [None]:
import pandas as pd
import numpy as np
news_df = pd.DataFrame(news_lst)

In [None]:
news_df.head()

Unnamed: 0,title,link,keywords,creator,video_url,description,content,pubDate,image_url,source_id,country,category,language
0,Elon Musk Says Twitter’s Had A Massive Revenue...,https://deadline.com/2022/11/elon-musk-twitter...,"[Advertising, Breaking News, Social Media, act...",[jillg366],,As Twitter employees face mass layoffs startin...,,2022-11-04 16:11:22,https://deadline.com/wp-content/uploads/2022/1...,deadline,[united states of america],[business],english
1,Twitter layoffs begin as Elon Musk admits ‘mas...,https://www.theguardian.com/technology/2022/no...,"[Twitter, Elon Musk, Technology]",[Dominic Rushe and Gloria Oladipo in New York ...,,Billionaire blames financial woes on activist ...,,2022-11-04 16:00:12,https://i.guim.co.uk/img/media/7cba98d05719d33...,theguardian,[united states of america],[business],english
2,Inside Twitter's chaotic short-notice layoffs,https://www.nbcnews.com/tech/tech-news/twitter...,,"[Daniel Arkin and Lora Kolodny, CNBC]",,Twitter was plunged into turmoil Friday after ...,Twitter was plunged into turmoil Friday after ...,2022-11-04 15:49:18,https://media-cldnry.s-nbcnews.com/image/uploa...,nbcnews,[united states of america],[business],english
3,Twitter spent years building its staff. Under ...,https://www.latimes.com/business/story/2022-11...,,[Samantha Masunaga],,Elon Musk reportedly plans to lay off as much ...,"Hiring is challenging for every industry, but ...",2022-11-04 15:48:44,https://ca-times.brightspotcdn.com/dims4/defau...,latimes,[united states of america],[business],english
4,Elon Musk's Twitter begins laying off employee...,https://news.google.com/__i/rss/rd/articles/CB...,,,,Elon Musk's Twitter begins laying off employee...,,2022-11-04 15:32:00,,google,[united states of america],[business],english


In this sample code, we extracted around 1000 news articles using newsdata.io. In the next section we will read a larger dataset to kick off our text analysis.