<a href="https://colab.research.google.com/github/tegacodess/My-Data-Projects/blob/main/Countries_scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Country Data Webscrape Activity

In this notebook, I practiced my newly gained knowledge on webscraping. The goal is to get the data on the Countries of the world from the  [Scrape This Site sandbox](https://www.scrapethissite.com/pages/simple/ ).

At  the end of this, I would store the scraped data in a pandas dataframe, and then convert the dataframe into formats CSV, TSV, and Excel.   

#### Tools Used


*   Python
*   Pandas Library: For Data Structuring
* BeautifulSoup Library: To parse HTML content
* Request Library:  to Handle HTTP requests



In [None]:
# import necessary libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

In [None]:
# storing the url of the site i would be scraping in a variable and ensuring it works
url = 'https://www.scrapethissite.com/pages/simple/'
url

'https://www.scrapethissite.com/pages/simple/'

In [None]:
# send a http get request to the site and show its status code
response = requests.get(url)
response.status_code

200

In [None]:
# Parsing the HTML content of the page
soup = BeautifulSoup(response.content, 'html.parser')

## Scrape Data

#### Checklist
The following data will be obtained and stored in a list:

*   Country name
*   Country capital
* Country population
* Area of the country ( $km^2$ )

#### Method Used
* Find all elements of a particular category (e.g country name) using its html class name and element.
* Create a list to store the data
* Using a for loop, strip the html code from each iter and append it to the created list




#### 1. Country Name


In [None]:
country = soup.find_all('h3', class_ = 'country-name')

country_name =[]

# for loop to iterate over the scraped data containing the html code, extract the text and store in created list (country_name)
for name in country:
  name_ = name.get_text(strip=True)
  country_name.append(name_)



####  2. Country Capital

In [None]:
country_capital = soup.find_all('span', class_='country-capital')

capital =[]

for cap in country_capital:
  cap_ = cap.get_text(strip=True)
  capital.append(cap_)


#### 3. Country Population



In [None]:
country_population = soup.find_all('span', class_='country-population')

population =[]

for pop in country_population:
  pop_ = pop.get_text(strip=True)
  population.append(pop_)


### Scrape Country Area

In [None]:
country_area = soup.find_all('span', class_='country-area')
area= []
for aa in country_area:
  aa_ = aa.get_text(strip=True)
  area.append(aa_)



### Create a dataframe with the curated lists.

In [None]:
data = pd.DataFrame()

data['Country'] = country_name
data['Capital'] = capital
data['Population'] = population
data["Area (km\u00b2)"] = area



In [None]:
data.to_csv('Countries WebScrape', index= False)

In [None]:
# Same logic, but refactored and shorter code length

url =  'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
# response.status_code

soup = BeautifulSoup(response.text, 'html.parser')

# Scraping data
Countries = []
country = soup.find_all('div', class_ = 'col-md-4 country' )

# For loop to get data and append it in the 'Countries' list
for count in country:
  name = count.find('h3', class_ = 'country-name').get_text(strip = True)
  capital =count.find('span', class_ = 'country-capital').get_text(strip = True)
  population = count.find('span', class_ = 'country-population').get_text(strip = True)
  area =  count.find('span', class_ = 'country-area').get_text(strip = True)

  Countries.append({
      'Name' :name,
      'Capital': capital,
      'Population': population,
      'Area (km\u00b2)' : area
  })

# Conversion to DataFrame
Countries_Data=pd.DataFrame(Countries)


In [None]:
# Export CSV, TSV and Excel respectively
Countries_Data.to_csv('Scraped Countries Data.csv', index=True)
Countries_Data.to_csv('Scraped Countries Data.tsv', sep ='\t', index=True)
Countries_Data.to_excel('Scraped Countries Data.xlsx', index=False)


#### Conclusion

The data contained in the Checklist; country name, capital, population and area, were succesfully obtained from the site and stored in a pandas dataframe which was then converted into three file formats csv, tsv and excel.