# Month 2 - exams

In [33]:
import pandas as pd
import numpy as numpy
from scipy import stats
import os

## Question1 - Pandas for Data Analysis

### Instructions
You are required to perform the following data wrangling tasks using Pandas and NumPy.

Load the dataset directly from the GitHub link https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/eletronic_sales.csv provided.

Write clean and efficient Python code for each question.

Ensure each solution outputs a well-structured DataFrame or value as required.

Your submission should be made as a Jupyter Notebook (.ipynb) file.

Include both your code and outputs for every question.

Ensure your notebook is clearly organized and well-commented.

e. Product-Level Summary (6 marks)
Generate a summary DataFrame that shows each product’s average price and total units sold.


### a. Data Loading, Datetime Conversion and Feature Extraction (6 marks)
Load the dataset from the GitHub link. Convert the Date column to datetime format, and create new columns for Year, Month, Day, and Day_of_Week.

In [26]:
url = "https://raw.githubusercontent.com/ek-chris/Practice_datasets/refs/heads/main/eletronic_sales.csv"
df = pd.read_csv(url)

df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Day_of_Week'] = df['Date'].dt.day_name()
df.head(2)

Unnamed: 0,Date,Branch,Sales Agent,Products,Units,Price,Year,Month,Day,Day_of_Week
0,2014-09-01,Woji,Chinedu,Apple,2,125.0,2014,9,1,Monday
1,2015-06-17,Woji,Emeka,Apple,5,125.0,2015,6,17,Wednesday


### b. Branch-Level Total Sales (6 marks)
Calculate the total sales for each branch, where Total Sales = Units × Price. Return a new DataFrame showing Branch and Total_Sales.

In [10]:
# Calculate Total_Sales per branch and return a new DataFrame
branch_total_sales = (
	df.assign(Total_Sales=df['Units'] * df['Price'])
	  .groupby('Branch', as_index=False)['Total_Sales']
	  .sum()
	  .sort_values('Total_Sales', ascending=False)
)

branch_total_sales



Unnamed: 0,Branch,Total_Sales
2,Woji,11139.07
0,GRA,6002.09
1,Town,2486.72


### c. Top Performing Sales Agent (6 marks)
Determine the top-performing sales agent based on total sales across all branches. Display both the agent’s name and their total sales amount.

In [16]:
# Top-performing sales agent by total sales across all branches
agent_total_sales = (
	df.assign(Total_Sales=df['Units'] * df['Price'])
	  .groupby('Sales Agent', as_index=False)['Total_Sales']
	  .sum()
	  .sort_values('Total_Sales', ascending=False)
)

# Display the full agent totals and the top performer
agent_total_sales.head()

Unnamed: 0,Sales Agent,Total_Sales
3,Emeka,3109.44
2,Chioma,3102.3
6,Tolu,2812.19
0,Blessing,2363.04
5,Ibrahim,1749.87


### d. Introducing and Filling Missing Values (6 marks)
Using NumPy, introduce missing values in the Price column for rows 5, 15, and 25. After that, fill the missing values using the median of the Price column.

In [27]:
# Introduce NaNs in Price for rows 5, 15, 25
df.loc[[5, 15, 25], 'Price'] = numpy.nan

# Compute median and fill missing values
price_median = df['Price'].median()
df['Price'].fillna(price_median, inplace=True)

# Confirm changes
print("Filled median:", price_median)
df.loc[[5, 15, 25], ['Price']]


Filled median: 4.99


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Price'].fillna(price_median, inplace=True)


Unnamed: 0,Price
5,4.99
15,4.99
25,4.99


### e. Product-Level Summary (6 marks)
Generate a summary DataFrame that shows each product’s average price and total units sold.

In [28]:
# Product-level summary: average price and total units sold
product_summary = (
	df.groupby('Products', as_index=False)
	  .agg(Average_Price=('Price', 'mean'),
		   Total_Units_Sold=('Units', 'sum'))
)

# Round average price and sort by total units sold
product_summary['Average_Price'] = product_summary['Average_Price'].round(2)
product_summary = product_summary.sort_values('Total_Units_Sold', ascending=False).reset_index(drop=True)

product_summary

Unnamed: 0,Products,Average_Price,Total_Units_Sold
0,HP,11.52,722
1,Lenovo,3.01,716
2,Dell,11.91,395
3,Compaq,5.19,278
4,Apple,175.0,10


## Question2 - Numpy for numeric computation

### Instructions
You are required to perform the following data analysis and manipulation tasks using NumPy.

Simulate or generate arrays as instructed in each question.

Write clean, well-commented, and efficient Python code for each solution.

Ensure that each output is properly displayed and easy to interpret.

Each question carries 2 marks, for a total of 10 marks.

Include both your code and outputs for every question.

Follow consistent formatting and clear naming conventions for variables.

### a. Array Creation and Basic Manipulation (6 marks)
Create a NumPy array containing 20 random integers between 10 and 100. Then perform the following tasks:

Reshape the array into a 4×5 matrix.

Extract the first two rows and last three columns from the reshaped array.

Compute the mean and standard deviation of the entire array.

In [29]:
# Create 20 random integers between 10 and 100 (inclusive)
numpy.random.seed(0)  # for reproducibility
arr = numpy.random.randint(10, 101, size=20)

# Reshape into a 4x5 matrix
arr_reshaped = arr.reshape(4, 5)

# Extract the first two rows and last three columns
subarray = arr_reshaped[:2, -3:]

# Compute mean and standard deviation of the entire array
mean_val = arr.mean()
std_val = arr.std()

# Display results
print("Original 1D array:", arr)
print("\nReshaped (4x5):\n", arr_reshaped)
print("\nFirst two rows, last three columns:\n", subarray)
print(f"\nMean: {mean_val:.2f}")
print(f"Standard Deviation: {std_val:.2f}")

Original 1D array: [54 57 74 77 77 19 93 31 46 97 80 98 98 22 68 75 49 97 56 98]

Reshaped (4x5):
 [[54 57 74 77 77]
 [19 93 31 46 97]
 [80 98 98 22 68]
 [75 49 97 56 98]]

First two rows, last three columns:
 [[74 77 77]
 [31 46 97]]

Mean: 68.30
Standard Deviation: 25.09


### b. Operations on 2D Arrays (6 marks)
Simulate a 2D array representing students’ scores in 5 subjects (10 students).

Calculate the average score per student.
Determine the highest and lowest score in the dataset.

In [30]:
# Simulate 10 students x 5 subjects (scores 0-100)
scores = numpy.random.randint(0, 101, size=(10, 5))

# Average score per student
student_avg = scores.mean(axis=1).round(2)

# Highest and lowest score in the dataset
highest_score = scores.max()
lowest_score = scores.min()

# Present results as a DataFrame
scores_df = pd.DataFrame(scores, columns=[f"Subject_{i+1}" for i in range(scores.shape[1])])
scores_df['Average'] = student_avg

# Display
scores_df, highest_score, lowest_score

(   Subject_1  Subject_2  Subject_3  Subject_4  Subject_5  Average
 0         81         37         25         77         72     58.4
 1          9         20         80         69         79     51.4
 2         47         64         82         99         88     76.0
 3         49         29         19         19         14     26.0
 4         39         32         65          9         57     40.4
 5         32         31         74         23         35     39.0
 6         75         55         28         34          0     38.4
 7          0         36         53          5         38     26.4
 8         17         79          4         42         58     40.0
 9         31          1         65         41         57     39.0,
 np.int32(99),
 np.int32(0))

### c. Working with 3D Arrays (6 marks)
Create a 3D NumPy array with dimensions (3, 4, 2) filled with random integers between 1 and 20. Perform the following:

Find the sum of elements across the second axis.
Compute the maximum value along each layer.
Flatten the entire 3D array into a 1D array.

In [31]:
# Create a 3D NumPy array (3, 4, 2) with random integers between 1 and 20
numpy.random.seed(1)  # for reproducibility
three_d = numpy.random.randint(1, 21, size=(3, 4, 2))

# 1) Sum of elements across the second axis (axis=1)
sum_across_second_axis = three_d.sum(axis=1)  # result shape: (3, 2)

# 2) Maximum value along each 2D layer (each slice along axis=0)
max_per_layer = three_d.max(axis=(1, 2))  # result shape: (3,)

# 3) Flatten the entire 3D array into a 1D array
flattened = three_d.ravel()

# Display results
print("3D array (shape {}):\n{}".format(three_d.shape, three_d))
print("\nSum across the second axis (axis=1) (shape {}):\n{}".format(sum_across_second_axis.shape, sum_across_second_axis))
print("\nMaximum value in each 2D layer (shape {}):\n{}".format(max_per_layer.shape, max_per_layer))
print("\nFlattened array (length {}):\n{}".format(flattened.size, flattened))

3D array (shape (3, 4, 2)):
[[[ 6 12]
  [13  9]
  [10 12]
  [ 6 16]]

 [[ 1 17]
  [ 2 13]
  [ 8 14]
  [ 7 19]]

 [[ 6 19]
  [12 11]
  [15 19]
  [ 5 10]]]

Sum across the second axis (axis=1) (shape (3, 2)):
[[35 49]
 [18 63]
 [38 59]]

Maximum value in each 2D layer (shape (3,)):
[16 19 19]

Flattened array (length 24):
[ 6 12 13  9 10 12  6 16  1 17  2 13  8 14  7 19  6 19 12 11 15 19  5 10]


## Question3 - Statistics for statistical analysis
Instructions
Read all questions below carefully.
You are required to use the Markdown section of your Jupyter Notebook (.ipynb) to compute and present your solutions for all questions.
Show all necessary code, workings, and provide brief text explanations for your answers where required.

### a. Measures of Center and Spread(6 marks)

Given the dataset of $CO_2$ emissions (in metric tons per capita) from five countries: [25.4, 30.2, 22.5, 28.1, 35.0]

(a) Compute the mean, median, and mode.
(b) Determine the range and standard deviation.
(c) Comment briefly on the spread of the data.

In [48]:
# Compute measures for CO2 emissions dataset
data = numpy.array([25.4, 30.2, 22.5, 28.1, 35.0])

mean_val = data.mean()
median_val = numpy.median(data)

# Determine mode(s) using pandas (handles ties); if all unique -> no mode
counts = pd.Series(data).value_counts()
if counts.max() == 1:
	modes = None
else:
	modes = counts[counts == counts.max()].index.tolist()

range_val = data.max() - data.min()
std_val = data.std()  # population std (ddof=0), consistent with earlier cells

# Display results
print(f"Mean: {mean_val:.2f}")
print(f"Median: {median_val:.2f}")
print(f"Mode: {modes if modes is not None else 'No unique mode (all values appear once)'}")
print(f"Range: {range_val:.2f}")
print(f"Standard Deviation: {std_val:.2f}")

# Brief comment on spread
print("\nComment: The values are moderately spread around the mean (std ≈ "
	  f"{std_val:.2f}) with a range of {range_val:.2f}; no repeated values, so no unique mode.")

Mean: 28.24
Median: 28.10
Mode: No unique mode (all values appear once)
Range: 12.50
Standard Deviation: 4.26

Comment: The values are moderately spread around the mean (std ≈ 4.26) with a range of 12.50; no repeated values, so no unique mode.


### b. Hypothesis Testing(6 marks)

Two samples of beef consumption (kg/person/year) are given:

Argentina: [60, 62, 58, 63, 59]
Bangladesh: [15, 12, 18, 14, 16]
Perform a two-sample t-test at a 5% significance level ($\alpha=0.05$) to determine whether there is a significant difference in mean beef consumption between the two countries.

(a) State the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$) clearly.
(b) Compute the t-statistic and the p-value using your notebook.
(c) State your conclusion based on the p-value.

In [47]:
# Two-sample t-test (Welch) for Argentina vs Bangladesh beef consumption
alpha = 0.05
argentina = numpy.array([60, 62, 58, 63, 59])
bangladesh = numpy.array([15, 12, 18, 14, 16])

# (a) Hypotheses
# H0: mu_Argentina = mu_Bangladesh
# H1: mu_Argentina != mu_Bangladesh

# Sample stats
n1, n2 = argentina.size, bangladesh.size
mean1, mean2 = argentina.mean(), bangladesh.mean()
s1, s2 = argentina.std(ddof=1), bangladesh.std(ddof=1)

# t-statistic
t_stat = (mean1 - mean2) / numpy.sqrt(s1**2 / n1 + s2**2 / n2)

# degrees of freedom
num = (s1**2 / n1 + s2**2 / n2) ** 2
den = (s1**4) / (n1**2 * (n1 - 1)) + (s2**4) / (n2**2 * (n2 - 1))
df = num / den

# p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df)

# Output results
print("H0: mu_Argentina = mu_Bangladesh")
print("H1: mu_Argentina != mu_Bangladesh\n")
print(f"t-statistic: {t_stat:.4f}")
print(f"degrees of freedom: {df:.4f}")
print(f"p-value (two-sided): {p_value:.4e}\n")

if p_value < alpha:
	print(f"Conclusion: reject H0 (p < {alpha}) — significant difference in mean consumption.")
else:
	print(f"Conclusion: fail to reject H0 (p >= {alpha}) — no significant difference detected.")

H0: mu_Argentina = mu_Bangladesh
H1: mu_Argentina != mu_Bangladesh

t-statistic: 33.2889
degrees of freedom: 7.9549
p-value (two-sided): 7.9315e-10

Conclusion: reject H0 (p < 0.05) — significant difference in mean consumption.


### c. Correlation Analysis(6 marks)

Given the following data for consumption (x) and $CO_2$ emission (y):

Consumption (x)	$CO_2$ Emission (y)
10	30
15	45
20	50
25	70
30	85
(a) Compute the Pearson correlation coefficient (r) between x and y.
(b) Interpret the result (comment on the strength and direction of the relationship).
(c) Briefly explain what it means if $r \approx 0$.

In [35]:
# (a) Compute Pearson correlation coefficient for given x and y
x = numpy.array([10, 15, 20, 25, 30])
y = numpy.array([30, 45, 50, 70, 85])

r, p_value = stats.pearsonr(x, y)

print(f"Pearson r: {r:.4f}")
print(f"p-value: {p_value:.4e}\n")

# (b) Interpretation (brief)
print("Interpretation: r is positive and close to 1, indicating a strong positive linear relationship.")
print("As consumption (x) increases, CO2 emission (y) tends to increase.\n")

# (c) Meaning of r ≈ 0 (brief)
print("If r ≈ 0: there is little or no linear correlation between the variables.")
print("This does not rule out a non-linear relationship; it only indicates no linear association.")

Pearson r: 0.9872
p-value: 1.7314e-03

Interpretation: r is positive and close to 1, indicating a strong positive linear relationship.
As consumption (x) increases, CO2 emission (y) tends to increase.

If r ≈ 0: there is little or no linear correlation between the variables.
This does not rule out a non-linear relationship; it only indicates no linear association.


## Question4 - Linear Algebra
Instructions
Background: We are analyzing the performance of 4 students in 3 subjects: Mathematics, English, and Science. The data is represented by the matrix $A$: $$A = \begin{bmatrix} 80 & 70 & 90 \ 60 & 85 & 75 \ 95 & 88 & 92 \ 70 & 60 & 65 \end{bmatrix}$$ Each row represents a student, and each column represents a subject (Math, English, Science).

### a. Total Scores per Student(6 marks)

Compute the total score for each student by summing the elements in each row. Present your result as a $4 \times 1$ column vector.

In [36]:
# Matrix A: each row is a student, columns are [Math, English, Science]
A = numpy.array([
	[80, 70, 90],
	[60, 85, 75],
	[95, 88, 92],
	[70, 60, 65]
])

# Total score per student as a 4x1 column vector
total_scores = A.sum(axis=1).reshape(-1, 1)

total_scores

array([[240],
       [220],
       [275],
       [195]])

### b. Average Score per Subject(6 marks)

Compute the average score for each subject by calculating the mean of each column of matrix $A$. Present your result as a $1 \times 3$ row vector representing the averages for Math, English, and Science.

In [37]:
# Compute average score per subject for matrix A (Math, English, Science)
subject_averages = A.mean(axis=0).reshape(1, -1)
subject_averages

array([[76.25, 75.75, 80.5 ]])

### c. Weighted Final Grades(6 marks)

The subjects have importance weights given by the vector $w = [0.5, 0.3, 0.2]$. Use matrix multiplication to compute each student's weighted final grade. The operation is $G = A w^T$. Show the resulting column vector $G$.

In [46]:
# weights vector for [Math, English, Science]
weight = numpy.array([0.5, 0.3, 0.2])

# Compute weighted final grades: G = A @ w^T and present as a 4x1 column vector
weighted_grade = (A @ weight).reshape(-1, 1)

weighted_grade

array([[79. ],
       [70.5],
       [92.3],
       [66. ]])

### d. Applying Subject Importance(6 marks)

Suppose Mathematics is considered twice as important as English and Science.

(a) Create a new matrix $A'$ by performing a scalar multiplication on the Math column of $A$ (multiply the first column by 2).
(b) Recompute the total score for each student using this new matrix $A'$.
(c) Compare the new totals to those from Question 6 and briefly discuss the changes.

In [39]:
# (a) Create A' by doubling the Math (first) column
A_prime = A.copy()
A_prime[:, 0] = A_prime[:, 0] * 2

# (b) Recompute total scores for each student as a 4x1 column vector
total_scores_prime = A_prime.sum(axis=1).reshape(-1, 1)

# (c) Compare new totals to the originals and show percent change
comparison = pd.DataFrame({
	'Original_Total': total_scores.flatten(),
	'New_Total': total_scores_prime.flatten(),
	'Difference': (total_scores_prime - total_scores).flatten()
})
comparison['Pct_Change'] = (comparison['Difference'] / comparison['Original_Total'] * 100).round(2)

# Display results
A_prime, total_scores_prime, comparison

(array([[160,  70,  90],
        [120,  85,  75],
        [190,  88,  92],
        [140,  60,  65]]),
 array([[320],
        [280],
        [370],
        [265]]),
    Original_Total  New_Total  Difference  Pct_Change
 0             240        320          80       33.33
 1             220        280          60       27.27
 2             275        370          95       34.55
 3             195        265          70       35.90)