<a href="https://colab.research.google.com/github/sazidthe1/Data-Science-Zero-to-Hero/blob/main/scraping_static_website.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Scrape the last table from this Wikipedia page: [Red states and blue states
](https://en.wikipedia.org/wiki/Red_states_and_blue_states)

For this task, I will be using **Beautiful Soup** and **Requests** library. Let's dive in.

## Step 1: Importing the libraries

In [1]:
import requests
import bs4
import pandas as pd

## Step 2: Identifying the URL

In [2]:
url = 'https://en.wikipedia.org/wiki/Red_states_and_blue_states'

## Step 3: Sending the requests to the website

In [3]:
webpage = bs4.BeautifulSoup(requests.get(url, 'html.parser').text)

### Step 4: Loading the webpage

In [4]:
webpage

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Red states and blue states - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vect

## Step 5: Importing all the tables and the desired one

In [5]:
tables = webpage.find_all(name='table', attrs={'class':'wikitable'})
tables[-1]

<table class="wikitable sortable mw-collapsible" style="text-align:center">
<tbody><tr>
<th>Year
</th>
<th style="text-align:center; width:7.69%;"><a href="/wiki/1972_United_States_presidential_election" title="1972 United States presidential election">1972</a>
</th>
<th style="text-align:center; width:7.69%;"><a href="/wiki/1976_United_States_presidential_election" title="1976 United States presidential election">1976</a>
</th>
<th style="text-align:center; width:7.69%;"><a href="/wiki/1980_United_States_presidential_election" title="1980 United States presidential election">1980</a>
</th>
<th style="text-align:center; width:7.69%;"><a href="/wiki/1984_United_States_presidential_election" title="1984 United States presidential election">1984</a>
</th>
<th style="text-align:center; width:7.69%;"><a href="/wiki/1988_United_States_presidential_election" title="1988 United States presidential election">1988</a>
</th>
<th style="text-align:center; width:7.69%;"><a href="/wiki/1992_United_S

In [6]:
# Checking the rows of the table (to match with numbers)
table_rows = tables[-1].find_all(name='tr')
len(table_rows)

59

## Step 6: Selecting the first row

In [7]:
col_names = [raw_name.text.replace('\n', '') for raw_name in table_rows[0].find_all(name='th')]
col_names # Checking the first row

['Year',
 '1972',
 '1976',
 '1980',
 '1984',
 '1988',
 '1992',
 '1996',
 '2000',
 '2004',
 '2008',
 '2012',
 '2016',
 '2020']

## Step 7: Selecting the table (body)

In [8]:
contents = [raw_name.text.replace('\n', '').strip() for raw_name in table_rows[2].find_all(name='td')]
contents  # Checking the table (body)

['Democratic candidate',
 'George McGovern',
 'Jimmy Carter',
 'Jimmy Carter',
 'Walter Mondale',
 'Michael Dukakis',
 'Bill Clinton',
 'Bill Clinton',
 'Al Gore',
 'John Kerry',
 'Barack Obama',
 'Barack Obama',
 'Hillary Clinton',
 'Joe Biden']

## Step 8: Iterating the table (body) in a list

In [9]:
contents = []
for row_id in range(2, len(table_rows)):
    content = {col_names[idx] : raw_name.text.replace('\n', '').strip() for idx, raw_name in enumerate(table_rows[row_id].find_all(name='td'))}
    contents.append(content)



```
# Same result as above (just in one go)
contents = [{col_names[idx] : raw_name.text.replace('\n', '').strip() for idx, raw_name in enumerate(table_rows[row_id].find_all(name='td'))} for row_id in range(2, len(table_rows))]
```



## Step 9: Creating a dataframe and loading top 5 rows

In [10]:
df = pd.DataFrame(data=contents, columns=col_names)
df.head()

Unnamed: 0,Year,1972,1976,1980,1984,1988,1992,1996,2000,2004,2008,2012,2016,2020
0,Democratic candidate,George McGovern,Jimmy Carter,Jimmy Carter,Walter Mondale,Michael Dukakis,Bill Clinton,Bill Clinton,Al Gore,John Kerry,Barack Obama,Barack Obama,Hillary Clinton,Joe Biden
1,Republican candidate,Richard Nixon,Gerald Ford,Ronald Reagan,Ronald Reagan,George H. W. Bush,George H. W. Bush,Bob Dole,George W. Bush,George W. Bush,John McCain,Mitt Romney,Donald Trump,Donald Trump
2,National popular vote,Nixon,Carter,Reagan,Reagan,Bush,Clinton,Clinton,Gore,Bush,Obama,Obama,Clinton,Biden
3,Alabama,Nixon,Carter,Reagan,Reagan,Bush,Bush,Dole,Bush,Bush,McCain,Romney,Trump,Trump
4,Alaska,Nixon,Ford,Reagan,Reagan,Bush,Bush,Dole,Bush,Bush,McCain,Romney,Trump,Trump


## Step 10: Save the dataset

In [11]:
df.to_csv('red_blue_states.csv', index=False)