# Web Scraping

![Data Science Workflow](img/ds-workflow.png)

## Acquire Data
### Common Data Sources
- **The Internet - Web Scraping**
- Databasis
- CSV
- Excel
- Parquet

### Web Scraping
- Extracting data from websites
- Leagal issues: [wikipedia.org](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)
- The legality of web scraping varies across the world.
- In general, web scraping may be against the terms of use of some websites, but the enforceability of these terms is unclear.

### Be ethical
- Not for commercial use
- Only private use

## Example
- Let's consider [https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics](https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics)
- **pandas** ```.read_html(.)``` Read HTML tables into a list of DataFrame objects ([docs](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html)).

In [1]:
import pandas as pd

In [2]:
url='https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics'
data=pd.read_html(url)

In [3]:
type(data)

list

In [4]:
type(data[0])

pandas.core.frame.DataFrame

In [5]:
len(data)

1

In [6]:
data[0].head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2020/21,PDF,"$ 162,886,686","$ 111,839,819","$ 50,861,811","$ 231,177,536"
1,2019/20,PDF,"$ 129,234,327","$ 112,489,397","$ 14,674,300","$ 180,315,725"
2,2018/19,PDF,"$ 120,067,266","$ 91,414,010","$ 30,691,855","$ 165,641,425"
3,2017/18,PDF,"$ 104,505,783","$ 81,442,265","$ 21,619,373","$ 134,949,570"
4,2016/17,PDF,"$ 91,242,418","$ 69,136,758","$ 21,547,402","$ 113,330,197"


In [7]:
funraising = data[0]

In [8]:
funraising.dtypes

Year            object
Source          object
Revenue         object
Expenses        object
Asset rise      object
Total assets    object
dtype: object

In [9]:
funraising['Expenses']=funraising['Expenses'].str[2:]
funraising['Expenses']=funraising['Expenses'].str.replace(',','')
funraising['Expenses']=funraising['Expenses'].astype(float)
#funraising['Expenses']=pd.to_numeric(funraising['Expenses'])

In [10]:
funraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2020/21,PDF,"$ 162,886,686",111839819.0,"$ 50,861,811","$ 231,177,536"
1,2019/20,PDF,"$ 129,234,327",112489397.0,"$ 14,674,300","$ 180,315,725"
2,2018/19,PDF,"$ 120,067,266",91414010.0,"$ 30,691,855","$ 165,641,425"
3,2017/18,PDF,"$ 104,505,783",81442265.0,"$ 21,619,373","$ 134,949,570"
4,2016/17,PDF,"$ 91,242,418",69136758.0,"$ 21,547,402","$ 113,330,197"


In [11]:
funraising['Revenue']=funraising['Revenue'].str[2:]
funraising['Revenue']=funraising['Revenue'].str.replace(',','')
funraising['Revenue']=funraising['Revenue'].astype(float)

In [12]:
funraising['Asset rise']=funraising['Asset rise'].str[2:]
funraising['Asset rise']=funraising['Asset rise'].str.replace(',','')
funraising['Asset rise']=funraising['Asset rise'].astype(float)

In [13]:
funraising['Total assets']=funraising['Total assets'].str[2:]
funraising['Total assets']=funraising['Total assets'].str.replace(',','')
funraising['Total assets']=funraising['Total assets'].astype(float)

In [14]:
funraising.dtypes

Year             object
Source           object
Revenue         float64
Expenses        float64
Asset rise      float64
Total assets    float64
dtype: object

## Data Wrangling
- Data wrangling (data munging): transforming and mapping data from one "raw" data form into another format
- With the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics

### Check the data types
- Remember ```.dtypes```

In [15]:
funraising.dtypes

Year             object
Source           object
Revenue         float64
Expenses        float64
Asset rise      float64
Total assets    float64
dtype: object

In [16]:
funraising.head()

Unnamed: 0,Year,Source,Revenue,Expenses,Asset rise,Total assets
0,2020/21,PDF,162886686.0,111839819.0,50861811.0,231177536.0
1,2019/20,PDF,129234327.0,112489397.0,14674300.0,180315725.0
2,2018/19,PDF,120067266.0,91414010.0,30691855.0,165641425.0
3,2017/18,PDF,104505783.0,81442265.0,21619373.0,134949570.0
4,2016/17,PDF,91242418.0,69136758.0,21547402.0,113330197.0


In [18]:
funraising.dtypes

Year            datetime64[ns]
Source                  object
Revenue                float64
Expenses               float64
Asset rise             float64
Total assets           float64
dtype: object