<div style="text-align: center;">
  <h2>Milestone 3: Formatting Website Data Source</h2>
</div>

## Data Source and Handling

The dataset used in this project was sourced from Wikipedia: [https://en.wikipedia.org/wiki/List_of_countries_by_arable_land_density](https://en.wikipedia.org/wiki/List_of_countries_by_arable_land_density).

## Steps

### 1 - Import necessary Libraries

In [17]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

# Ignore all Warnings
import warnings
warnings.filterwarnings('ignore')

### 2 - Reading Webdata

Retrieved data from Wikipedia using the requests library.

In [18]:
# Make a get requests to Wikipedia using Requests

page= requests.get ('https://en.wikipedia.org/wiki/List_of_countries_by_arable_land_density')
print (f'Requests status Code:',page.status_code)


Requests status Code: 200


### 3 - Identify the Tabular component by count

Parsed the HTML content with BeautifulSoup to count and analyze the number of tabular elements present.

In [19]:
#Read the page from requests response using html parser

soup = BeautifulSoup(page.text,'html.parser') 

tables = soup.findAll("table")  #Scan all the table
print (f'Total Tables:',len(tables)) 

Total Tables: 2


### 4 - Identify the right Table

Evaluated each table based on the number of rows and columns to identify the most relevant one—Table 1 was selected for its structure and completeness.

In [20]:
# Find the right table by finding no of rows and columns for each table

cntr = 0 
num_cols = 0

for data in tables:
    cntr += 1  #Increment the Counter to display Table index
    rows = data.find_all('tr') #Find all the tr to find no of rows
    
    for row in rows: #To find no of columns loop through each row
        cols = row.find_all(['th']) #Find all the header
        num_cols = max(num_cols, len(cols)) # Get the max of the count

    print (f'Table {cntr}: \n No of rows: {len(rows)} \n No of columns: {num_cols}')

Table 1: 
 No of rows: 221 
 No of columns: 6
Table 2: 
 No of rows: 14 
 No of columns: 6


### 5 - Extract the right Table

Extracted data from Table 1 and converted it into a pandas DataFrame for further analysis.

In [21]:
# Identify table 1
target_table = tables[0]  # 0-based index

# Extract headers
headers = [th.get_text(strip=True) for th in target_table.find_all('tr')[0].find_all('th')]

# Extract all rows
rows = []
for tr in target_table.find_all('tr')[1:]:
    cells = tr.find_all(['td', 'th'])
    row = [cell.get_text(strip=True) for cell in cells]
    if row and len(row) == len(headers):  # Ensure the row matches column count
        rows.append(row)

# Create DataFrame
df = pd.DataFrame(rows, columns=headers)

# Show preview
print(df.head())

     Location Arable m²/ person Persons /arable km2 %arable Arableland (km2)  \
0       World             1,800                 570     11%       14,000,000   
1  Kazakhstan            15,456                  65     11%          296,697   
2   Australia            12,062                  83      4%          312,650   
3      Canada            10,027                 100      4%          382,590   
4   Argentina             9,322                 107     15%          422,088   

      Population  
0  7,900,000,000  
1     19,196,465  
2     25,921,089  
3     38,155,012  
4     45,276,780  


### 5 - Replacing column Title

Renamed the columns with more descriptive and readable labels to enhance clarity.

In [22]:
# Renaming columns
df.rename(columns={'Location':'Area','Arable m²/ person':'Arable per person','Persons /arable km2':'Person per Arable','%arable':'Arable%','Arableland (km2)':'Total Arable'}, inplace=True)

df.head (10)

Unnamed: 0,Area,Arable per person,Person per Arable,Arable%,Total Arable,Population
0,World,1800,570,11%,14000000,7900000000
1,Kazakhstan,15456,65,11%,296697,19196465
2,Australia,12062,83,4%,312650,25921089
3,Canada,10027,100,4%,382590,38155012
4,Argentina,9322,107,15%,422088,45276780
5,Russia,8384,119,7%,1216490,145102755
6,Lithuania,8178,122,36%,22790,2786651
7,Ukraine,7563,132,57%,329240,43531422
8,"Saint Helena, Ascension and Tristan da Cunha",7402,135,10%,40,5404
9,Latvia,7268,138,22%,13620,1873919


### 6 - Remove Unwanted columns

Focused the dataset on crop production and land usage by removing columns related to population statistics, which are outside the scope of this analysis.

In [None]:
#Drop columns
columns_to_drop = ['Arable per person', 'Person per Arable','Population']
df = df.drop(columns_to_drop, axis=1)

df.head(-5)

Unnamed: 0,Area,Arable%,Total Arable
0,World,11%,14000000
1,Kazakhstan,11%,296697
2,Australia,4%,312650
3,Canada,4%,382590
4,Argentina,15%,422088
...,...,...,...
210,Bermuda,6%,3
211,Cayman Islands,0.8%,2
212,Djibouti,0.1%,30
213,Kuwait,0.4%,80


### 7 - Remove Unwanted data

Removed aggregate "World" data to narrow the focus to individual countries, aligning with the objective of country-specific analysis.

In [26]:
# Drop records

i = df[(df.Area == 'World')].index #find index for World record
df.drop(i,axis=0,inplace=True)

df.head(-5)

Unnamed: 0,Area,Arable%,Total Arable
1,Kazakhstan,11%,296697
2,Australia,4%,312650
3,Canada,4%,382590
4,Argentina,15%,422088
5,Russia,7%,1216490
...,...,...,...
210,Bermuda,6%,3
211,Cayman Islands,0.8%,2
212,Djibouti,0.1%,30
213,Kuwait,0.4%,80


Given that this dataset from Wikipedia is already cleaned, sourced from 2021, and well-structured, there is currently no need to perform duplicate checks, outlier detection, or apply fuzzy logic. Based on the context and clarity of the data, additional transformations are not required at this stage. However, future transformations may be considered if this dataset is joined with others.

### Ethical Considerations in Wrangling Open-Source Agricultural Data from Wikipedia

The data wrangling process for this project involved retrieving publicly available data from Wikipedia, parsing and extracting tabular information using BeautifulSoup, and converting it into a structured pandas DataFrame. Key transformations included selecting the most comprehensive table (Table 1), cleaning and renaming column headers for clarity, and narrowing the dataset by removing non-country-specific or unrelated information (like aggregate “World” data and population metrics). Since Wikipedia is an open-source platform with collaborative contributions, while generally reliable, the credibility of the data relies on the quality and accuracy of its sources—usually cited in the references. No private or personally identifiable information was used, so there are no major legal restrictions under common data regulations like GDPR. However, the primary ethical risk lies in making analytical decisions based on potentially outdated or user-modified content. Assumptions made during cleaning—such as interpreting column meanings and removing certain data rows—could influence analysis outcomes if not transparently documented. To mitigate ethical concerns, it is important to cross-reference this data with official sources such as the UN or World Bank and clearly document any assumptions or transformations made, maintaining transparency throughout the project.