![Image](https://drive.google.com/uc?export=view&id=10B8NecPfn9sXRescmijQ8Zc2CO08fQm7)

## **Data Collection via Web Scraping & API**
### ACC Tech Challenge Series, Sping 2020
### Harper Xiang


# Json API

“JSON (JavaScript Object Notation) is a lightweight data-interchange format. … JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

JSON is built on two structures:
A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.”

(https://www.json.org)

In [1]:
#Requests allows you to send HTTP/1.1 requests extremely easily
import requests
import json

In [2]:
# Find a URL
# https://github.com/r-spacex/SpaceX-API
url = "https://api.spacexdata.com/v2/launchpads"

In [4]:
# Getting the data from the URL server to local
response = requests.get(url)
response

<Response [200]>

In [5]:
# using JSON structure to store the data gotten
data = response.json()
print(json.dumps(data, indent=4, sort_keys=True))

[
    {
        "attempted_launches": 0,
        "details": "SpaceX original west coast launch pad for Falcon 1. Performed a static fire but was never used for a launch and abandoned due to scheduling conflicts.",
        "full_name": "Vandenberg Air Force Base Space Launch Complex 3W",
        "id": "vafb_slc_3w",
        "location": {
            "latitude": 34.6440904,
            "longitude": -120.5931438,
            "name": "Vandenberg Air Force Base",
            "region": "California"
        },
        "name": "VAFB SLC 3W",
        "padid": 5,
        "status": "retired",
        "successful_launches": 0,
        "vehicles_launched": [
            "Falcon 1"
        ],
        "wikipedia": "https://en.wikipedia.org/wiki/Vandenberg_AFB_Space_Launch_Complex_3"
    },
    {
        "attempted_launches": 58,
        "details": "SpaceX primary Falcon 9 launch pad, where all east coast Falcon 9s launched prior to the AMOS-6 anomaly. Initially used to launch Titan rockets for Lockhe

In [6]:
len(data)

6

In [7]:
# Getting the value of a Field (“Key”)
data[0]['name']

'VAFB SLC 3W'

# Social Media Mining

Applying for Twitter API:

https://developer.twitter.com/en/docs

Tweepy Guideline:

https://tweepy.readthedocs.io/en/v3.5.0/

http://docs.tweepy.org/en/v3.6.0/api.html

Install Twitter library/wraps

$$ pip install tweepy

In [8]:
#!pip install tweepy
import tweepy

In [9]:
# Import Twitter API Keys
from config import consumer_key, consumer_secret, access_token, access_token_secret

In [11]:
#Authorised access with the API
auth=tweepy.OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN,OAUTH_TOKEN_SECRET)
api=tweepy.API(auth)

### Get Tweets from the User(whose API_key)’s home page

In [12]:
# Including the User’s own and his following’s tweets
public_tweets = api.home_timeline()
public_tweets

[Status(_api=<tweepy.api.API object at 0x7fef9f7e4d50>, _json={'created_at': 'Thu Oct 22 21:18:10 +0000 2020', 'id': 1319387597362388992, 'id_str': '1319387597362388992', 'text': "The Practitioner's Guide to Graph Data — Applying Graph Thinking and Graph Technologies to Solve Complex Problems:… https://t.co/jANgkfHhtD", 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/jANgkfHhtD', 'expanded_url': 'https://twitter.com/i/web/status/1319387597362388992', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [116, 139]}]}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 534563976, 'id_str': '534563976', 'name': 'Kirk Borne', 'screen_name': 'KirkDBorne', 'location': 'Maryland, USA', 'description': 'Prin

### Get a certain User’s tweets

In [13]:
username = "@billgates"

# with the specified number of page number (x)
public_tweets = api.user_timeline(username, count=10)
#public_tweets = api.user_timeline(username, page=x)
public_tweets

[Status(_api=<tweepy.api.API object at 0x7fef9f7e4d50>, _json={'created_at': 'Tue Oct 20 18:30:42 +0000 2020', 'id': 1318620676962480129, 'id_str': '1318620676962480129', 'text': 'RT @StephenCurry30: Even with his busy schedule, Dr. Fauci took the time to sit down with me (AGAIN) and talk about what we’ve gotten right…', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'StephenCurry30', 'name': 'Stephen Curry', 'id': 42562446, 'id_str': '42562446', 'indices': [3, 18]}], 'urls': []}, 'source': '<a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 50393960, 'id_str': '50393960', 'name': 'Bill Gates', 'screen_name': 'BillGates', 'location': 'Seattle, WA', 'description': "Sharing things I'm learning through my foundation work and other interests.",

In [12]:
user_account = api.get_user("@billgates")
user_account

User(_api=<tweepy.api.API object at 0x7fa488f972d0>, _json={'id': 50393960, 'id_str': '50393960', 'name': 'Bill Gates', 'screen_name': 'BillGates', 'location': 'Seattle, WA', 'profile_location': {'id': '300bcc6e23a88361', 'url': 'https://api.twitter.com/1.1/geo/id/300bcc6e23a88361.json', 'place_type': 'unknown', 'name': 'Seattle, WA', 'full_name': 'Seattle, WA', 'country_code': '', 'country': '', 'contained_within': [], 'bounding_box': None, 'attributes': {}}, 'description': "Sharing things I'm learning through my foundation work and other interests.", 'url': 'https://t.co/emd1hfqSRD', 'entities': {'url': {'urls': [{'url': 'https://t.co/emd1hfqSRD', 'expanded_url': 'https://gatesnot.es/blog', 'display_url': 'gatesnot.es/blog', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 51922557, 'friends_count': 237, 'listed_count': 119746, 'created_at': 'Wed Jun 24 18:44:10 +0000 2009', 'favourites_count': 141, 'utc_offset': None, 'time_zone': None, 'ge

### Get a User's info

In [14]:
try:
    target_user = username
    user_account = api.get_user(target_user)

    # Get the specific column data
    user_real_name  = user_account.name
    user_tweets_num = user_account.statuses_count
    user_followers  = user_account.followers_count
    user_friends    = user_account.friends_count
    user_favorites  = user_account.favourites_count
    user_Language   = user_account.lang   # ”en”

except tweepy.TweepError as e:
    print(f"exception for {row['Screen Name']}: {e}")

In [15]:
print(f'{user_real_name} has {user_tweets_num} tweets, {user_followers} followers, and {user_friends} friends.')

Bill Gates has 3401 tweets, 52348015 followers, and 240 friends.


## Ready Functions

In [16]:
### FUNCTION: Get Twitter User Info ###
import datetime

def get_user_info(user_account):

    # Check user_name format
    if (user_account[0] != '@'):
        return NULL
        
    # Get user info from the Json object
    try:
        obj_user = api.get_user(user_account)
        # Get the specific column data
        user_id                = obj_user.id_str
        user_name              = obj_user.name
        user_screen_name       = obj_user.screen_name
        user_verified          = obj_user.verified
        user_created_at        = obj_user.created_at
        user_location          = obj_user.location
        user_followers_count   = obj_user.followers_count
        user_friends_count     = obj_user.friends_count
        user_listed_count      = obj_user.listed_count
        user_favourites_count  = obj_user.favourites_count
        user_statuses_count    = obj_user.statuses_count
        user_description       = obj_user.description

    except tweepy.TweepError as e:
        print(f"exception for {user_account}: {e}")

    # Date/Time Transformation
    #converted_time   = datetime.datetime.strptime(user_created_at, "%a %b %d %H:%M:%S %z %Y")
    #user_create_time = converted_time.strftime("%Y-%m-%d %H:%M:%S")
        
    # Return the user info needed
    user_info = {
        "user_id"                : user_id,
        "user_account"           : user_account,
        "user_name"              : user_name,
        "user_screen_name"       : user_screen_name,
        "user_verified"          : user_verified,
        #"user_created_at"        : user_create_time,
        "user_location"          : user_location,
        "user_followers_count"   : user_followers_count,
        "user_friends_count"     : user_friends_count,
        "user_listed_count"      : user_listed_count,
        "user_favourites_count"  : user_favourites_count,
        "user_statuses_count"    : user_statuses_count,
        "user_description"       : user_description
    }
    return user_info

In [18]:
user_info = get_user_info(username)
user_info

{'user_id': '50393960',
 'user_account': '@billgates',
 'user_name': 'Bill Gates',
 'user_screen_name': 'BillGates',
 'user_verified': True,
 'user_location': 'Seattle, WA',
 'user_followers_count': 52348028,
 'user_friends_count': 240,
 'user_listed_count': 119720,
 'user_favourites_count': 142,
 'user_statuses_count': 3401,
 'user_description': "Sharing things I'm learning through my foundation work and other interests."}

In [19]:
### FUNCTION: Get Twitter User Info ###
import pandas as pd

def get_user_tweets(user_account, count_tweets=200):
    
    # Check user_name format
    if ((user_account[0] != '@') or not(count_tweets >= 1 and count_tweets <= 1000)):
        return NULL
    
    # Get the tweets by the user
    user_tweets = api.user_timeline(user_account, count=count_tweets)
    #print(len(user_tweets))
    
    # Prepare for the dataframe
    tweets_id             = []
    tweets_user_id        = []
    tweets_user_account   = []
    tweets_created_at     = []
    tweets_longitude      = []
    tweets_latitude       = []
    tweets_quote_count    = []
    tweets_reply_count    = []
    tweets_retweet_count  = []
    tweets_favorite_count = []
    tweets_lang           = []
    tweets_text           = []
    
    # For each tweet
    for tweet in user_tweets:
        # Get the tweet's info from the Json object
        tweet_id                        = tweet["id_str"]
        tweet_user_id                   = tweet["user"]["id_str"]
        tweet_user_acount               = user_account
        tweet_created_at                = tweet["created_at"]
        #tweet_quote_count               = tweet["quote_count"]
        tweet_quote_count               = ""
        #tweet_reply_count               = tweet["reply_count"]
        tweet_reply_count               = ""
        tweet_retweet_count             = tweet["retweet_count"]
        tweet_favorite_count            = tweet["favorite_count"]
        tweet_lang                      = tweet["lang"]
        tweet_text                      = tweet["text"]

        #print(json.dumps(tweet["coordinates"], indent=4, sort_keys=True))
        if tweet["coordinates"]:
            tweet_longitude = tweet["coordinates"]["coordinates"][0]
            tweet_latitude  = tweet["coordinates"]["coordinates"][1]
        else:
            tweet_longitude = ""
            tweet_latitude  = ""
        
        # Date/Time Transformation
        converted_time = datetime.datetime.strptime(tweet_created_at, "%a %b %d %H:%M:%S %z %Y")
        tweet_time     = converted_time.strftime("%Y-%m-%d %H:%M:%S")

        # Add the info of the tweet into the dataframe
        tweets_id.append(tweet_id)
        tweets_user_id.append(tweet_user_id)
        tweets_user_account.append(tweet_user_acount)
        tweets_created_at.append(tweet_time)
        tweets_longitude.append(tweet_longitude)
        tweets_latitude.append(tweet_latitude)
        tweets_quote_count.append(tweet_quote_count)
        tweets_reply_count.append(tweet_reply_count)
        tweets_retweet_count.append(tweet_retweet_count)
        tweets_favorite_count.append(tweet_favorite_count)
        tweets_lang.append(tweet_lang)
        tweets_text.append(tweet_text)

    # Return: tweets info
    df_tweets_info = pd.DataFrame({
        "twitter_account"    : tweets_user_account,
        "tweet_id"           : tweets_id,
        "user_id"            : tweets_user_id,
        "created_at"         : tweets_created_at,
        "quote_count"        : tweets_quote_count,
        "reply_count"        : tweets_reply_count,
        "retweet_count"      : tweets_retweet_count,
        "favorite_count"     : tweets_favorite_count,
        "lang"               : tweets_lang,
        "longitude"          : tweets_longitude,
        "latitude"           : tweets_latitude,
        "text"               : tweets_text
    })
    return df_tweets_info

# Web Scraping

In [20]:
from bs4 import BeautifulSoup as bs

In [21]:
html_string = """
<html>
<head>
<title>
A Simple HTML Document
</title>
</head>
<body>
<p>This is a very simple HTML document</p>
<p>It only has two paragraphs</p>
</body>
</html>
"""

In [46]:
# Print formatted version of the soup
print(soup.prettify())

<html>
 <head>
  <title>
   A Simple HTML Document
  </title>
 </head>
 <body>
  <p>
   This is a very simple HTML document
  </p>
  <p>
   It only has two paragraphs
  </p>
 </body>
</html>



In [22]:
# Parse the HTML string
soup = bs(html_string, 'html.parser')
type(soup)

bs4.BeautifulSoup

In [24]:
# Extract the title of the HTML document
soup.title

<title>
A Simple HTML Document
</title>

In [25]:
# Extract the contents of the HTML body
soup.body

<body>
<p>This is a very simple HTML document</p>
<p>It only has two paragraphs</p>
</body>

In [27]:
# Text of the first paragraph
soup.body.p.text

'This is a very simple HTML document'

In [28]:
# Extract all paragraph elements
soup.body.find_all('p')

[<p>This is a very simple HTML document</p>, <p>It only has two paragraphs</p>]

In [29]:
# Extract paragraph by index
soup.body.find_all('p')[0]

<p>This is a very simple HTML document</p>

## Craigslist web scraping

In [30]:
# URL of page to be scraped
url = 'https://newjersey.craigslist.org/search/sss?sort=rel&query=guitar'
# Retrieve page with the requests module
response = requests.get(url)

In [31]:
# Create BeautifulSoup object; parse with 'html.parser'
soup = bs(response.text, 'html.parser')
# Examine the results, then determine element that contains sought info
print(soup.prettify())

﻿
<!DOCTYPE html>
<html class="no-js">
 <head>
  <title>
   north jersey for sale "guitar"  - craigslist
  </title>
  <script id="ld_breadcrumb_data" type="application/ld+json">
   {"@context":"https://schema.org","itemListElement":[{"item":{"name":"newjersey.craigslist.org","@id":"https://newjersey.craigslist.org"},"position":1,"@type":"ListItem"},{"item":{"name":"for sale","@id":"https://newjersey.craigslist.org/search/sss"},"position":2,"@type":"ListItem"}],"@type":"BreadcrumbList"}
  </script>
  <meta content='north jersey for sale "guitar"  - craigslist' name="description"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible">
   <link href="https://newjersey.craigslist.org/search/sss?query=guitar&amp;sort=rel" rel="canonical"/>
   <link href="https://newjersey.craigslist.org/search/sss?s=120&amp;query=guitar&amp;sort=rel" rel="next"/>
   <meta content="width=device-width,initial-scale=1" name="viewport"/>
   <link href="//www.craigslist.org/styles/cl.css?v=9100bd643974b9e54643

In [32]:
# results are returned as an iterable list
results = soup.find_all('li', class_="result-row")

In [33]:
# Loop through returned results
for result in results:
    # Error handling
    try:
        # Identify and return title of listing
        title = result.find('a', class_="result-title").text
        # Identify and return price of listing
        price = result.a.span.text
        # Identify and return link to listing
        link = result.a['href']

        # Print results only if title, price, and link are available
        if (title and price and link):
            print('-------------')
            print(title)
            print(price)
            print(link)
    except AttributeError as e:
        print(e)

-------------
Firefly FFDCS Guitar
$200
https://newjersey.craigslist.org/msg/d/montville-firefly-ffdcs-guitar/7212588397.html
-------------
Behringer Strat style guitar
$125
https://newjersey.craigslist.org/msg/d/west-milford-behringer-strat-style/7206563782.html
'NoneType' object has no attribute 'text'
'NoneType' object has no attribute 'text'
-------------
Unbranded 1960's  Electric Guitar Korean
$225
https://newjersey.craigslist.org/msg/d/west-milford-unbranded-1960s-electric/7206563186.html
-------------
Canvas CVF L.P. style guitar
$150
https://newjersey.craigslist.org/msg/d/hamburg-canvas-cvf-lp-style-guitar/7218528947.html
-------------
2003 Squier Showmaster guitar
$150
https://newjersey.craigslist.org/msg/d/hamburg-2003-squier-showmaster-guitar/7218529284.html
-------------
De Armond Jet Star guitar
$350
https://newjersey.craigslist.org/msg/d/hamburg-de-armond-jet-star-guitar/7218525343.html
-------------
DV Mark DVC Guitar Friend 12 Combo Amp
$200
https://newjersey.craigslis

## Web Scraping with Pandas
Pandas automatically scrapes any tabular data from a page

In [34]:
import pandas as pd

In [35]:
url = 'https://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'
tables = pd.read_html(url)
len(tables)

8

It returns a list of dataframes for any tabular data that Pandas found

In [36]:
for i in range(len(tables)):
    df = tables[i]
    print(f"Table [{i}]#################################")
    print(df.head())

Table [0]#################################
                         City                    Building  \
              Albany Congress             Albany Congress   
0            Albany, New York                  Stadt Huys   
1          Stamp Act Congress          Stamp Act Congress   
2          New York, New York                   City Hall   
3  First Continental Congress  First Continental Congress   
4  Philadelphia, Pennsylvania            Carpenters' Hall   

                   Start Date                    End Date  \
              Albany Congress             Albany Congress   
0               June 19, 1754               July 11, 1754   
1          Stamp Act Congress          Stamp Act Congress   
2             October 7, 1765            October 25, 1765   
3  First Continental Congress  First Continental Congress   
4           September 5, 1774            October 26, 1774   

                     Duration                         Ref  
              Albany Congress            

In [37]:
df = tables[1].iloc[2:]
#df.columns = ['State', 'Abr.', 'State-hood Rank', 'Capital', 'Capital Since', 'Area (sq-mi)', 'Municipal Population', 'Metropolitan', 'Metropolitan Population', 'Population Rank', 'Notes']
df.set_index('State', inplace=True)
df.head()

Unnamed: 0_level_0,Capital,Capital Since,Area (mi2),Population (2019 est.),MSA/µSA Population (2019 est.),CSA Population (2019 est.),Rank in State (city proper)
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Arizona,Phoenix,1912,517.6,1680992,4948203,5002221,1
Arkansas,Little Rock,1821,116.2,197312,742384,908941,1
California,Sacramento,1854,97.9,513624,2363730,2639124,6
Colorado,Denver,1867,153.3,727211,2967239,3617927,1
Connecticut,Hartford,1875,17.3,122105,1204877,1470083,3


In [38]:
df.loc['Illinois']

Capital                           Springfield
Capital Since                            1837
Area (mi2)                               54.0
Population (2019 est.)                 114230
MSA/µSA Population (2019 est.)         206868
CSA Population (2019 est.)             306399
Rank in State (city proper)                 6
Name: Illinois, dtype: object