<img src="logo.png" alt="Image Description" width="200" height="35" />

# <span style ='color : black'> **STAGE A**: Introduction to Python for Machine Learning </span>

 ## <span style ='color : black'> Graded Quiz </span>
> The [dataset](https://github.com/HamoyeHQ/HDSC-Introduction-to-Python-for-machine-learning) at hand is a compilation of agricultural information provided by the **Food and Agriculture Organization** (*FAO*) of the **United Nations**. It encompasses a wide array of data points pertaining to various aspects of agriculture, food production, and consumption from the year 2014 to 2018. 


![Food Sample](sample.png)
*Image by <a href="https://www.freepik.com/free-photo/vegetable-with-space-bottom_1298891.htm#query=Food%20and%20Agriculture%20header&position=27&from_view=search&track=ais">Freepik</a>*

In [1]:
# import required libraries
import pandas as pd
import numpy as np

In [2]:
# load the data
food_df = pd.read_csv("FoodBalanceSheets_E_Africa_NOFLAG.CSV", encoding = 'latin-1')
food_df.head(3)

Unnamed: 0,Area Code,Area,Item Code,Item,Element Code,Element,Unit,Y2014,Y2015,Y2016,Y2017,Y2018
0,4,Algeria,2501,Population,511,Total Population - Both sexes,1000 persons,38924.0,39728.0,40551.0,41389.0,42228.0
1,4,Algeria,2501,Population,5301,Domestic supply quantity,1000 tonnes,0.0,0.0,0.0,0.0,0.0
2,4,Algeria,2901,Grand Total,664,Food supply (kcal/capita/day),kcal/capita/day,3377.0,3379.0,3372.0,3341.0,3322.0


In [3]:

# display the DataFrame shape
print('================================================================')
print(f'DataFrame shape: {food_df.shape}\n')

# display the DataFrame info
print('================================================================')
print(f'DataFrame info: {food_df.info()}\n')


DataFrame shape: (60943, 12)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60943 entries, 0 to 60942
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Area Code     60943 non-null  int64  
 1   Area          60943 non-null  object 
 2   Item Code     60943 non-null  int64  
 3   Item          60943 non-null  object 
 4   Element Code  60943 non-null  int64  
 5   Element       60943 non-null  object 
 6   Unit          60943 non-null  object 
 7   Y2014         59354 non-null  float64
 8   Y2015         59395 non-null  float64
 9   Y2016         59408 non-null  float64
 10  Y2017         59437 non-null  float64
 11  Y2018         59507 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 5.6+ MB
DataFrame info: None



In [4]:
# display the DataFrame description
food_df.describe()

Unnamed: 0,Area Code,Item Code,Element Code,Y2014,Y2015,Y2016,Y2017,Y2018
count,60943.0,60943.0,60943.0,59354.0,59395.0,59408.0,59437.0,59507.0
mean,134.265576,2687.176706,3814.856456,134.196282,135.235966,136.555222,140.917765,143.758381
std,72.605709,146.055739,2212.007033,1567.663696,1603.403984,1640.007194,1671.862359,1710.782658
min,4.0,2501.0,511.0,-1796.0,-3161.0,-3225.0,-1582.0,-3396.0
25%,74.0,2562.0,684.0,0.0,0.0,0.0,0.0,0.0
50%,136.0,2630.0,5142.0,0.09,0.08,0.08,0.1,0.07
75%,195.0,2775.0,5511.0,8.34,8.46,8.43,9.0,9.0
max,276.0,2961.0,5911.0,176405.0,181137.0,185960.0,190873.0,195875.0


`food_df` contains information on various key factors, including geographical details, specific agricultural items, production elements, and measurement units. The dataset is structured around the following features:

1. **Area Code**: A numerical code representing specific geographical regions.
2. **Area**: The name of the geographical area.
3. **Item Code**: A numerical code representing specific agricultural items or products.
4. **Item**: The name of the agricultural item or product.
5. **Element Code**: A numerical code representing specific agricultural elements or indicators.
6. **Element**: The description of the agricultural element or indicator.
7. **Unit**: The measurement unit used for the data.
8. **Y2014, Y2015, Y2016, Y2017, Y2018**: Data columns containing information for each respective year.

In [5]:
# copy DataFrame
copy_df = food_df.copy()

---

### <span style ='color : green'> **Question 1** </span>
What is the total number of **unique countries** in the dataset?

<span style ='color : orange'> **Ans:** </span> 

In [6]:
# calculate number of unique values
print(f"Number of unique countrie in the dataset : {copy_df['Area'].nunique()}")

Number of unique countrie in the dataset : 49


### <span style ='color : green'> **Question 2** </span>
Consider the following list of tuples:

$y = [(2, 4), (7, 8), (1, 5, 9)]$

How would you **assign element 8** from the list to a variable `x`?

<span style ='color : orange'> **Ans:** </span> 

In [7]:
y = [(2, 4), (7, 8), (1, 5, 9)]
# assign element `8`
print(y[1][-1])
print(y[1][1])

8
8


### <span style ='color : green'> **Question 3** </span>
What is the **total** `Protein supply quantity` in Madagascar in **2015**?

<span style ='color : orange'> **Ans:** </span>

In [8]:
# add Madagascar filter to Area
mg = copy_df[copy_df['Area'] == 'Madagascar']
# groupby sum aggregation on `Element`
mg = mg.groupby('Element')['Y2015'].sum()
# locate `Protein supply quantity`
protein = mg.loc['Protein supply quantity (g/capita/day)']
print(f"Total Protein supply quantity in Madagascar year 2015, is {protein} g/capita/day")

Total Protein supply quantity in Madagascar year 2015, is 173.05 g/capita/day


### <span style ='color : green'> **Question 4** </span>
What would be the **output** for?

`S = [['him', 'sell'], [90, 28, 43]]`

`S[0][1][1]`

<span style ='color : orange'> **Ans:** </span>

In [9]:
S = [['him', 'sell'], [90, 28, 43]]

print(f"Output : {S[0][1][1]}")

Output : e


### <span style ='color : green'> **Question 5** </span>
What is the **total number and percentage** of missing data in **2014** to 3 decimal places?

<span style ='color : orange'> **Ans:** </span> 

In [10]:
# number of missing values in `Y2014`
missing = copy_df['Y2014'].isnull().sum()

# total number of columns
col = copy_df.shape[0] 

print("Missing Values in `Y2014`")
print("==========================")
print(f"Total number : {missing}")
print(f"Percentage : {round((missing/col)*100, 3)}%")

Missing Values in `Y2014`
Total number : 1589
Percentage : 2.607%


### <span style ='color : green'> **Question 6** </span>
Select columns `Y2017` and `Area`, Perform a groupby operation on `Area`.  Which of these Areas had the **7th lowest sum in 2017**?

<span style ='color : orange'> **Ans:** </span> 

In [11]:
# groupby sum aggregation on `Area`
areas_2017 = copy_df.groupby('Area')['Y2017'].sum()
# sort the values from the highest
sorted_areas = areas_2017.sort_values()[:7]

print(f"Area with the 7th lowest sum in 2017 : {sorted_areas.index[6]}")

Area with the 7th lowest sum in 2017 : Guinea-Bissau


### <span style ='color : green'> **Question 7** </span>
Given the following numpy array 

array  = ([[94, ***89, 63***],
            [93, ***92, 48***],
            [92, 94, 56]])

How would you **select  the elements in bold and italics** from the array?

<span style ='color : orange'> **Ans:** </span>

In [12]:
array = np.array([[94, 89, 63]
         ,[93, 92, 48]
         ,[92, 94, 56]])

array[:2, 1:]

array([[89, 63],
       [92, 48]])

### <span style ='color : green'> **Question 8** </span>
What is the **mean** and **standard deviation** across the whole dataset for the year **2017** to 2 decimal places?

<span style ='color : orange'> **Ans:** </span>

In [13]:
# calculate mean to 2 d.p
avg = round(copy_df['Y2017'].mean(),2)
# calculate std to 2 d.p
stdev = round(copy_df['Y2017'].std(),2)

print(f"Y2017 Mean: {avg}")
print(f"Y2017 Standard Deviation : {stdev}")

Y2017 Mean: 140.92
Y2017 Standard Deviation : 1671.86


### <span style ='color : green'> **Question 9** </span>
Which of these Python data structures is **unorderly**?

<span style ='color : orange'> **Ans:** </span> (Set) is unordered

### <span style ='color : green'> **Question 10** </span>
Perform a groupby operation on `Element`.  What year has the **highest sum** of `Stock Variation`?

<span style ='color : orange'> **Ans:** </span>

In [14]:
# groupby sum aggregation on `Element`
elements = copy_df.groupby('Element')[copy_df.iloc[:, -5:].columns].sum()
# locate `Stock Variation`
stock_v = elements.loc['Stock Variation'].sort_values(ascending=False)

print(f"Year that has the highest sum of `Stock Variation` : {stock_v.index[0]}")

Year that has the highest sum of `Stock Variation` : Y2014


### <span style ='color : green'> **Question 11** </span>
Which of the following is a **python inbuilt module**?


<span style ='color : orange'> **Ans:** </span> `Math`

### <span style ='color : green'> **Question 12** </span>
If you have the following list

`lst = [[35, 'Portugal', 94], [33, 'Argentina', 93], [30 , 'Brazil', 92]]`

`col = ['Age','Nationality','Overall']`

How do you **create a pandas DataFrame** using this list, to look like the table below?

|  |Age|Nationality| Overall|
|---|---|---|---|
| 1 | 35 | Portugal | 94 |
| 2 | 33 | Argentina | 93 |
| 3 | 30 | Brazil | 92 |


<span style ='color : orange'> **Ans:** </span> 

In [15]:
lst = [[35, 'Portugal', 94], [33, 'Argentina', 93], [30 , 'Brazil', 92]]
col = ['Age','Nationality','Overall']
pd.DataFrame(lst, columns = col, index = [i for i in range(1, 4)])

Unnamed: 0,Age,Nationality,Overall
1,35,Portugal,94
2,33,Argentina,93
3,30,Brazil,92


### <span style ='color : green'> **Question 13** </span>
Which year had the **least correlation** with `Element Code`?

<span style ='color : orange'> **Ans:** </span>

In [16]:
# list of years
years = [copy_df.iloc[:, 7:].columns][0]
# calculate the correlation between `Element Code` and the years
cor_lst = {}
for year in years:
    cor_n = round(copy_df['Element Code'].corr(copy_df[year]), 4)
    cor_lst[year] = cor_n
    
print(cor_lst)
print("====================================================================")
print(f"Year that has the least correlation with `Element Code` is 2016 with a corr of {round(cor_lst['Y2016'], 4)}")

{'Y2014': 0.0245, 'Y2015': 0.0239, 'Y2016': 0.0234, 'Y2017': 0.0243, 'Y2018': 0.0243}
Year that has the least correlation with `Element Code` is 2016 with a corr of 0.0234


### <span style ='color : green'> **Question 14** </span>
Select columns `Y2017` and `Area`, Perform a groupby operation on `Area`.  Which of these Areas had the **highest sum in 2017**

<span style ='color : orange'> **Ans:** </span>

In [17]:
# groupby sum aggregation on `Area`
areas = copy_df.groupby('Area')['Y2017'].sum()
# sort values from the highest
highest_area = areas.sort_values(ascending=False)


print(f"Area: {highest_area.index[0]}")
print(f"Sum: {highest_area[0]}")

Area: Nigeria
Sum: 1483268.23


### <span style ='color : green'> **Question 15** </span>
Which of the following **DataFrame methods** can be used to access elements across rows and columns?

<span style ='color : orange'> **Ans:** </span> `df.iloc[ : ]` 

### <span style ='color : green'> **Question 16** </span>
A pandas Dataframe with dimensions `(100,3)` has how many **features** and **observations**?

<span style ='color : orange'> **Ans:** </span> 
(100, 3) - 3 features, 100 observations

### <span style ='color : green'> **Question 17** </span>
How would you check for the **number of rows and columns** in a pandas DataFrame named `df`?

<span style ='color : orange'> **Ans:** </span> `df.shape`

### <span style ='color : green'> **Question 18** </span>
Given the following Python code, what would the **output** of the code give?

`my_tuppy = (1,2,5,8)`

`my_tuppy[2] = 6`

In [18]:
my_tuppy = (1,2,5,8)

my_tuppy[2] = 6

TypeError: 'tuple' object does not support item assignment

<span style ='color : orange'> **Ans:** </span> <span style ='color : red'> **Type Error** </span>

### <span style ='color : green'> **Question 19** </span>
Perform a groupby operation on 'Element'.  What is the **total number of the sum** of `Processing` in **2017**?

<span style ='color : orange'> **Ans:** </span>

In [19]:
# groupby sum aggregation on `Element`
elements = copy_df.groupby('Element')['Y2017'].sum()
# locate 'Processing'
processing = elements.loc['Processing']

print(f"Sum of Processing in 2017 : {processing}")

Sum of Processing in 2017 : 292836.0


### <span style ='color : green'> **Question 20** </span>
What is the **total sum** of `Wine produced` in **2015** and **2018** respectively?

*Hint*:*Perform a groupby sum aggregation on* `Item`

<span style ='color : orange'> **Ans:** </span>

In [20]:
# groupby sum aggregation on `Item`
items = copy_df.groupby('Item')[['Y2015','Y2018']].sum()
# locate `Wine`
wine = items.loc['Wine']

print("Total sum of wine produced in: ")
print("==============================")
print(f"Y2015 - {wine[0]}")
print(f"Y2018 - {wine[1]}")

Total sum of wine produced in: 
Y2015 - 4251.81
Y2018 - 4039.32
