# Web Scraping

![Data Science Workflow](img/ds-workflow.png)

## Acquire Data
### Common Data Sources
- **The Internet - Web Scraping**
- Databasis
- CSV
- Excel
- Parquet

### Web Scraping
- Extracting data from websites
- Leagal issues: [wikipedia.org](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)
- The legality of web scraping varies across the world.
- In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.

### Be ethical
- Not for commercial use
- Only private use

## Example
- Let's consider [https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics](https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics)
- **pandas** ```.read_html(.)``` Read HTML tables into a list of DataFrame objects ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)).

In [1]:
import pandas as pd

In [5]:
url='https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics'

data = pd.read_html(url)

In [7]:
type(data)

list

In [8]:
type(data[0])

pandas.core.frame.DataFrame

In [10]:
data[0].head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2020/2021,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536"
1,2019/2020,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725"
2,2018/2019,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425"
3,2017/2018,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570"
4,2016/2017,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197"


In [11]:
data[1].head()

Unnamed: 0,0,1
0,,Wikimedia Commons has media related to Wikimed...


In [12]:
fundraising = data[0]

In [15]:
fundraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
dtype: object

In [20]:
fundraising['Exp'] = fundraising['Expenses'].str[2:]
fundraising['Exp'] = fundraising['Exp'].str.replace(',',"")
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp
0,2020/2021,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819
1,2019/2020,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397
2,2018/2019,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010
3,2017/2018,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265
4,2016/2017,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197",69136758


In [21]:
fundraising['Exp'] = pd.to_numeric(fundraising['Exp'])

In [22]:
fundraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
dtype: object

## Data Wrangling
- Data wrangling (data munging): transforming and mapping data from one "raw" data form into another format
- With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics

### Check the data types
- Remember ```.dtypes```

In [27]:
fundraising['Rev'] = fundraising['Revenue'].str[2:]
fundraising['Rev'] = fundraising['Rev'].str.replace(',',"")
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Rev
0,2020/2021,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,162886686
1,2019/2020,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397,129234327
2,2018/2019,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010,120067266
3,2017/2018,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265,104505783
4,2016/2017,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197",69136758,91242418


In [28]:
fundraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
Exp              int64
Rev             object
dtype: object

In [29]:
fundraising.loc[0, 'Rev'] = 'spam'

In [32]:
fundraising['Rev'] = pd.to_numeric(fundraising['Rev'], errors='coerce')

In [33]:
fundraising.dtypes

Year             object
Source           object
Revenue          object
Expenses         object
Asset rise       object
Total assets     object
Exp               int64
Rev             float64
dtype: object

In [34]:
fundraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets,Exp,Rev
0,2020/2021,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536",111839819,
1,2019/2020,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725",112489397,129234327.0
2,2018/2019,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425",91414010,120067266.0
3,2017/2018,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570",81442265,104505783.0
4,2016/2017,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197",69136758,91242418.0
