<a href="https://colab.research.google.com/github/shicong621/Colab/blob/main/CS505_PA1_Student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this piece of code, we are going to practice scraping data from various sources, i.e., twitter, wikipedia and news websites.

**You are going to use the output of the Twitter part to finish your assignment 1 . The rest are going to be used later.**

First, we are going to learn how to scrap tweets from Twitter.

To do so, you need to sign up for a developer account [here](https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api).

After doing so, please make sure you have access to your Bearer token as well as other keys/tokens. Otherwise you won't be able to scrap data from Twitter.

Also, you may want to mount your Google Drive so that you could save the scraped data into your drive for later.

In [80]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We strongly recommend you to have a look at this [article](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9) 
to get familiar with the things you can do with scrapping tool.

If your problem still exist, don't hesitate to post question(s) on Piazza or come to the TA's office hours.

In [None]:
#Step 1: installing tweepy, the tool to scrap data from Twitter (before that, please make sure you have your Bearer ready)
!pip3 install tweepy
!pip3 install tweepy --upgrade # make sure your tweepy is up-to-date (>=4.10.1), otherwise there's a chance your won't be able to interact with Twitter API v2.
#Restart Runtime might be needed.

In [None]:
#Step 2: initiate your client
# You will need to get your bearer token from the email sent to you.
import tweepy
client = tweepy.Client(bearer_token='YOUR BEARER TOKEN HERE') # replace with your bearer token here.


In [None]:
#Step 3: a first try on scraping data, run the following code to see if we could collect some tweets related to 'football'
query = 'football lang:en -is:retweet' # the query restricts the collected tweets to contain 'football', to be in English, and to not be re-tweets. 
tweets = client.search_recent_tweets(query=query, max_results=100) # we are using search_recent_tweets, to search for tweets in recent 7 days. Bring 10 tweets back.
for tweet in tweets.data[:10]:
    print(tweet.text)
    print('-----------------------------')

Please notice Twitter API has a limit on how many times a user can send request to their data every 15 minutes. So please don't run the code too many times at the same time or use For/While loop to execute the above code. 

If you can see some tweets printed out, that means your scrapping tool **tweepy** is successfully set up. 

Before moving on, please read the following links to know how the above function work and how to write a specific query.

[query with Twitter API](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query)

[search_recent_tweets](https://docs.tweepy.org/en/latest/client.html#tweepy.Client.search_recent_tweets)

The maximum tweets we could scrap from running the function once is 10-100. To allow scrapping for more data, we could run pagination.

In [None]:
query = 'football -is:retweet'
tweets = list(tweepy.Paginator(client.search_recent_tweets, query=query, tweet_fields=['context_annotations', 'created_at'], max_results=100).flatten(limit=1000))
print("{} tweets are collected.".format(len(tweets)))

For more features with the scrapping tool, please refer to the [article](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9). Remember there are some features unavailable to your account (Availability: Essential).

Lastly, let's save our scrapped tweets into a file.

If you are using Google Colab, first, mount your Google Drive with the icon on the left hand side.

Then, locate the path to save by right clicking the folder to save -> copy path.

Change the value of 'driveFolderDirectory' below with your copied path.

In [None]:
import csv

driveFolderDirectory = '/content/drive/MyDrive/Colab Notebooks/' # if your are not using Google Colab, edit the value directly here.
savedFileName = 'tweets.csv'
pathToSave = driveFolderDirectory + savedFileName

with open(pathToSave, 'w', newline='') as csvfile:
  fieldnames = ['idx','tweetId', 'tweetText']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i,tweet in enumerate(tweets):
    writer.writerow({'idx': i, 'tweetId': tweet.id,'tweetText': tweet.data['text']})


Now, let's test if you know what your are doing. Get 10000 tweets mentioned 'nft' **written in English**. The tweet should **not be a re-tweet** and there is **no link** in it. Save the them in a .csv file with their **tweet id**, **created time**, and **tweet text**.

In [None]:
# IMPLEMENT YOUR CODE  HERE #
pass

Next, we are going to learn how to scrap data from a Wikipedia page:

https://en.wikipedia.org/wiki/Non-fungible_token

With requests and BeautifulSoup.



In [3]:
# First, let's installed the required libraries, i.e. requests and BeautifulSoup
!pip install requests
#BeautifulSoup should have been installed in Google Colab, if not please run: !pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Since you already have some experience in scraping data, this time I will leave some useful reading materials and a skeleton code for you to fill in and scrap the following data:

First, scrape the 
contents of this article in Wikipedia:
https://en.wikipedia.org/wiki/Pok%C3%A9mon. By saying "contents", we only refer to those texts that explains the idea of the title of the page, i.e. the "article" part of the webpage. See https://en.wikipedia.org/wiki/Wikipedia:What is an article%3F for what is defined as an article in Wikipedia. (We do not require you to scrap table(s) in the wikipedia at this point, but you should think about how to scrap them from the webpage as well, and how they can benefit training your language model later on.)

Second, scrape the contents of all articles within Wikipedia that are linked from only the content of this page i.e., you
don’t need to scrape the sidebar—you will have to look at the retrieved HTML of the first page and see the pattern
you can use to obtain links from this article’s content to other Wikipedia articles.

**Useful Materials:**

To get started, please read through the following materials to get an idea how to scrap data from a certain webpage.

If you are not familiar with HTML file or its format, please quickly go through the tutorial in [W3School](https://www.w3schools.com/html/).

An [introduction](https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/) of how to scrap and parse information from a webpage.

[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) (helpful when you want to know how to use certain functions in BeautifulSoup)

Please remember to set a delay time (>=1s) between each request to scrap a webpage, to avoid getting banned from Wikipedia for scraping too fast that could break their server. 

In [None]:
#First, import the needed libraries for scraping data from Wikipedia webpages.

import requests
from bs4 import BeautifulSoup
import time # for setting up a delay on getting htmls from wiki server.
from tqdm import tqdm

# First, get the page info from wiki server given an URL.
def getPageFromWiki(url):
    # get URL
    page = requests.get(url)
 
    # scrape webpage
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

# Second, get the title of the wiki page
def getHeading(soup):
    heading = soup.find('title').text
    return heading

# Third, get the article part of the wiki page 
def getContent(page):
    content = []
    texts = page.find_all('p')
    for text in texts:
        content.append(text.get_text())
    return content

# Fourth, get the links that the article part mentioned and specifically, linking to other wiki pages.
def getLinks(page):
    linksDict = {}

    links = page.find_all('a', href = True, title = True)
    for link in links:
        if (not link.get('href').startswith('https://')):
            url = 'https://en.wikipedia.org' + link.get('href')
            linksDict[link.get('title')] = url
        else:
            url = link.get('href')
            linksDict[link.get('title')] = url
            
    return linksDict

In [None]:
# Once you've implemented the above functions, run the following piece to see if a dictionary that contains the wiki articles we scraped.
# Run a for loop to get all the article contents from Wikipedia.
pageDict = {}

page = getPageFromWiki('https://en.wikipedia.org/wiki/Pok%C3%A9mon') # scrap the main page we want. 
header = getHeading(page)
content = getContent(page)
pageDict[header] = content
print(pageDict)

linksDict = getLinks(page) # get the links contained in the article part of the page.
print("a set of {} links are found.".format(len(linksDict)))

for title in tqdm(list(linksDict.keys())): # set up a loop to , set a delay at each iteration
  url = linksDict[title]
  page = getPageFromWiki(url)
  header = getHeading(page)
  content = getContent(page)
  pageDict[header] = content
  time.sleep(1) # Remember to set a delay >=1 second so you won't break the server.

print("a size of {} content dictionary is built.".format(len(pageDict)))

In [None]:
# Lastly, save your contents and corresponding title in a .csv file.
import csv

driveFolderDirectory = '/content/drive/MyDrive/Colab Notebooks/' # if your are not using Google Colab, edit the value directly here.
savedFileName = 'wikiContents.csv'
pathToSave = driveFolderDirectory + savedFileName

with open(pathToSave, 'w', newline='') as csvfile:
  fieldnames = ['idx','wikiTitle', 'wikiContents']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i,wikiContentKey in enumerate(pageDict.keys()):
    writer.writerow({'idx': i, 'wikiTitle': wikiContentKey,'wikiContents': pageDict[wikiContentKey]})

Finally, let's use what we learn from scraping data in wikipedia webpages to scrap news from news websites, i.e. ABC news and Fox news.

In this assignment, you need to get 100 news from each news source. The first thing to consider is to find out where we can get the 100 news URLs. If you've read the [article](https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/) in the wikipedia section, you will know reading the robots.txt of each news website is very helpful. Take a look at the robots.txt from both ABC news and Fox News. You will see that sitemap contains news URLs in the format of XML. You could scrap these sitemaps by BeautifulSoup.

In [4]:
def getPageFromWiki(url):
    # get URL
    page = requests.get(url)
 
    # scrape webpage
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup


In [6]:
import requests
from bs4 import BeautifulSoup

# def getPageFromWiki(url): # you may use the getPageFromWiki you implemented from the wikipedia section

abcNewsSitemap = getPageFromWiki('https://abcnews.go.com/xmlLatestStories')

# A similar one can be found in the Fox news robots.txt, here we will leave it to you to find out what the sitemap URL is.
foxNewsSitemap = getPageFromWiki('https://www.foxbusiness.com/sitemap.xml?type=news') # FILL THE SITEMAP URL HERE.


In [11]:
import urllib.request
from urllib.parse import urlparse
from bs4 import BeautifulSoup

In [71]:
import urllib.request
import xml.etree.ElementTree as et

with urllib.request.urlopen('https://www.foxbusiness.com/sitemap.xml?type=news') as url:
    data = url.read()

xml = et.fromstring(data)
nsmp = {"doc": "http://www.sitemaps.org/schemas/sitemap/0.9"}
       
foxUrlList = [] 

for url in xml.findall('doc:url', namespaces = nsmp):
   loc = url.find('doc:loc', namespaces = nsmp).text
  
   foxUrlList.append(loc)

In [78]:
with urllib.request.urlopen('https://abcnews.go.com/xmlLatestStories') as url:
    data = url.read()

xml = et.fromstring(data)
nsmp = {"doc": "http://www.sitemaps.org/schemas/sitemap/0.9"}
       
abcUrlList = [] 

for url in xml.findall('doc:url', namespaces = nsmp):
   loc = url.find('doc:loc', namespaces = nsmp).text
  
   abcUrlList.append(loc)

In [72]:
print("foxUrlList",foxUrlList[:100])



In [20]:
# The next step is to get article URLs, you could do so by the 'select' function from BeautifulSoup, but since the roadmap is a .xml file,
# Getting this is a little bit different from the wikipedia one. Regardless, the idea should be similar and it won't be too hard for you to find out how to do so.

#Try getting 100 news URLs from the sitemap. 
from time import sleep

def getUrlList(sitemap):
  # This function should return a list of URLs of news contained in the sitemap page.
  url_list = []
  for page in sitemap:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'lxml-xml', 
                         from_encoding=r.content.info().get_param('charset'))
    rows = soup.select('tbody tr')

    for row in rows:
        d = dict()
        d['name'] = row.select_one('.source-title').text.strip()
        d['allsides_page'] = 'https://abcnews.go.com' + row.select_one('.source-title a')['href']
        
        url_list.append(d)
    
    sleep(100)


  # IMPLEMENT YOUR CODE HERE: #

  return url_list


In [None]:
foxUrlList = []
abcUrlList = []

foxUrlList = getUrlList(foxNewsSitemap)
abcUrlList = getUrlList(abcNewsSitemap)

# Test here if the list contains the URLs you want.
print("foxUrlList",foxUrlList)
print("abcUrlList",abcUrlList)


Once we got the URL list, we could start working on extracting the 'article' part of the news within these webpages. One way to do so is mimic the form from the Wikipedia section. Here we introduce a 'shortcut', using a library called 'newspaper' to help us with that. 

Please have a brief look at their Github Repository, it's very easy to use. We will leave the task of scraping the article part of the news from each URL using 'newspaper' to you, with a template below.

The ABC/Fox news might contain less than 100 news in the sitemap page. Try refreshing the sitemap to get different ones until a number of 100 news are collected for each news source, or, try a different sitemap.

Again, you may want to set up a delay every time you make a request to the news server.

In [38]:
#install newspaper
!pip install newspaper3k

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting newspaper3k
  Downloading newspaper3k-0.2.8-py3-none-any.whl (211 kB)
[K     |████████████████████████████████| 211 kB 5.1 MB/s 
[?25hCollecting feedparser>=5.2.1
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[K     |████████████████████████████████| 81 kB 8.6 MB/s 
Collecting jieba3k>=0.35.1
  Downloading jieba3k-0.35.1.zip (7.4 MB)
[K     |████████████████████████████████| 7.4 MB 70.5 MB/s 
[?25hCollecting cssselect>=0.9.2
  Downloading cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting feedfinder2>=0.0.4
  Downloading feedfinder2-0.0.4.tar.gz (3.3 kB)
Collecting tinysegmenter==0.3
  Downloading tinysegmenter-0.3.tar.gz (16 kB)
Collecting tldextract>=2.0.1
  Downloading tldextract-3.4.0-py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 2.6 MB/s 
Collecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
Collecting requests-

In [43]:
from newspaper import Article
from tqdm import tqdm

#def getNewsDict(url_list):

  # Your key should be the news title and value should be the article text of the news.
  #newsDict = {}
  # IMPLEMENT YOUR CODE HERE:# 
  
  #return newsDict


#abcNews = getNewsDict(abcUrlList)
#foxNews = getNewsDict(foxUrlList)


In [56]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [62]:
article_all = []
article_title = []
abcNews = {}
for url in abcUrlList:
  article = Article(url, language="en") # en for English
  article.download()
  article.parse()
  article.nlp()
  article_all.append(article.text)
  article_title.append(article.title)
  zip_iterator = zip(article_title, article_all)
  abcNews = dict(zip_iterator)
  

In [74]:
foxUrlList.remove("https://www.foxbusiness.com/politics/us-steps-away-flagship-lithium-project-berkshire")

In [75]:
article_all2 = []
article_title2 = []
foxNews = {}
for url in foxUrlList:
  article = Article(url, language="en") # en for English
  article.download()
  article.parse()
  article.nlp()
  article_all2.append(article.text)
  article_title2.append(article.title)
  zip_iterator2 = zip(article_title2, article_all2)
  foxNews = dict(zip_iterator2)

In [77]:
len(foxNews)

95

In [64]:
len(abcNews)

55

In [79]:
#Lastly, write them down in a .csv file for both the abc and fox news. 

import csv

driveFolderDirectory = '/content/drive/MyDrive/Colab Notebooks/' # if your are not using Google Colab, edit the value directly here.
savedFileName = 'newsContents.csv'
pathToSave = driveFolderDirectory + savedFileName

# size check
#assert len(abcNews)>=100 and len(foxNews)>=100, "the size of both news dictionary should be no less than 100. got {} for abc news and {} for fox news instead.".format(len(abcNews),len(foxNews))

with open(pathToSave, 'w', newline='') as csvfile:
  fieldnames = ['idx','newsSource','newsTitle','newsContents']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i,newsDictKey in enumerate(abcNews.keys()):
    writer.writerow({'idx': i,'newsSource':'ABCNews', 'newsTitle': newsDictKey,'newsContents': abcNews[newsDictKey]})
  for i,newsDictKey in enumerate(foxNews.keys()):
    writer.writerow({'idx': i,'newsSource':'FoxNews', 'newsTitle': newsDictKey,'newsContents': foxNews[newsDictKey]})