# Countries of the world Analysis

#### In this notebook we will analyze the website: [Countries of the World](https://www.scrapethissite.com/pages/simple/)

#### The first step is to import all the necessary libraries
* requests
* pandas
* BeautifulSoup

In [37]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Now, let's fetch the website and make some soup

In [38]:
url = "https://www.scrapethissite.com/pages/simple/"
response = requests.get(url).text
soup = BeautifulSoup(response, "html.parser")

#### At a quick glance at the website, I know I will have the following columns:
- Country name
- Capital
- Population
- Area

#### So let's create a dictionary to hold all these values

In [39]:
country_info = {
    "Country":[],
    "Capital":[],
    "Population":[],
    "Area (km^2)":[]
}

#### With all the basics taken care of, let's scrape the website

In [40]:
# Each country information ar stored in this class
items = soup.find_all(class_="col-md-4 country")

# Loop to get the country, capita, population, area of each item and store in the dictionary
for item in items:
    country = item.find(class_="country-name").get_text().strip()
    capital = item.find(class_="country-capital").get_text().strip()
    population = item.find(class_="country-population").get_text().strip()
    area = item.find(class_="country-area" ).get_text().strip()

    country_info["Country"].append(country)
    country_info["Capital"].append(capital)
    country_info["Population"].append(population)
    country_info["Area (km^2)"].append(area)

# Transform the dictionary into a dataframe
df = pd.DataFrame(country_info)
df

Unnamed: 0,Country,Capital,Population,Area (km^2)
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


### Good! Our dataframe is looking very good! 

#### Before we do any analysis it is important to use the method .info() to get some insight on the data we will be working on.

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Country      250 non-null    object
 1   Capital      250 non-null    object
 2   Population   250 non-null    object
 3   Area (km^2)  250 non-null    object
dtypes: object(4)
memory usage: 7.9+ KB


#### Hmm, it looks like the column "Population" and "Area" are strings and not numerical values. Let's change that!

In [42]:
# Converting population to string
df["Population"] = df["Population"].astype(int)

#### There is another way to accomplish the same steps. 
#### By using .to_numeric() method it allows us to do more fun things such as convert the data to either integer, float, signed or unsigned. 
#### But also to deal with error at runtime. For more information: [to_numeric()](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html#pandas.to_numeric)

In [43]:
df["Area (km^2)"] = pd.to_numeric(df["Area (km^2)"], downcast="integer", errors="coerce")

#### Let's see how our dataframe looks like now.

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Country      250 non-null    object 
 1   Capital      250 non-null    object 
 2   Population   250 non-null    int64  
 3   Area (km^2)  250 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 7.9+ KB


#### Note that even though we wanted to change the values to integer, they turned into float and that's because some of the values were float values anyway.

#### With the values in the right data type, let's do some analysis now.

## Analysis

#### Country with the largest population

In [45]:
largest_pop = df["Population"].idxmax()
df.iloc[largest_pop]


Country             China
Capital           Beijing
Population     1330044000
Area (km^2)     9596960.0
Name: 47, dtype: object

#### Country with the largest land area

In [46]:
largest_area = df["Area (km^2)"].idxmax()
df.iloc[largest_area]

Country            Russia
Capital            Moscow
Population      140702000
Area (km^2)    17100000.0
Name: 190, dtype: object

#### Top 10 largest countries by population

In [47]:
largest_country_pop = df.sort_values(by=["Population"], ascending=False)
largest_country_pop.head(10)

Unnamed: 0,Country,Capital,Population,Area (km^2)
47,China,Beijing,1330044000,9596960.0
104,India,New Delhi,1173108018,3287590.0
232,United States,Washington,310232863,9629091.0
100,Indonesia,Jakarta,242968342,1919440.0
30,Brazil,Brasília,201103330,8511965.0
177,Pakistan,Islamabad,184404791,803940.0
18,Bangladesh,Dhaka,156118464,144000.0
163,Nigeria,Abuja,154000000,923768.0
190,Russia,Moscow,140702000,17100000.0
113,Japan,Tokyo,127288000,377835.0


### Top largest countries by area

In [48]:
largest_country_pop = df.sort_values(by=["Area (km^2)"], ascending=False)
largest_country_pop.head(10)

Unnamed: 0,Country,Capital,Population,Area (km^2)
190,Russia,Moscow,140702000,17100000.0
8,Antarctica,,0,14000000.0
37,Canada,Ottawa,33679000,9984670.0
232,United States,Washington,310232863,9629091.0
47,China,Beijing,1330044000,9596960.0
30,Brazil,Brasília,201103330,8511965.0
12,Australia,Canberra,21515754,7686850.0
104,India,New Delhi,1173108018,3287590.0
9,Argentina,Buenos Aires,41343201,2766890.0
124,Kazakhstan,Astana,15340000,2717300.0


#### Note that Antarctica showed up here. The website we scraped decided to add it to their list even though Antarctica is not a country, but a continent.

#### Let move on to the population. How many people are living in these countries?

In [49]:
total_pop = df["Population"].sum()
print(f"Total population: {total_pop:,} people")

Total population: 6,861,418,895 people


#### Perhaps we can do the same with the area? How much land do all these countries take of the world?

In [50]:
total_area = df["Area (km^2)"].sum()
print(f"Total area: {total_area:,} km^2")

Total area: 149,909,229.69 km^2


#### And last, let's look at another very important method that displays the statistical analysis of our data

In [51]:
df.describe()

Unnamed: 0,Population,Area (km^2)
count,250.0,250.0
mean,27445680.0,599636.9
std,116862600.0,1911821.0
min,0.0,0.0
25%,179856.2,1174.75
50%,4288138.0,64894.5
75%,15420620.0,372631.5
max,1330044000.0,17100000.0
