## Student Id:

# Part 2

### Task 6 - Explore the dataset to identify an "interesting" pattern or trend

In [1]:
import pandas as pd

def merge_datasets(part_2a, part_2b):
    # Merge datasets based on id_student
    merged = pd.merge(part_2a, part_2b, on="id_student", how="inner")
    return merged

def analyze_pattern(merged_data):
    # Group data by final_result to explore click_events trends
    trends = merged_data.groupby("final_result")["click_events"].mean()
    # Calculate correlation between click_events and score
    correlation = merged_data["click_events"].corr(merged_data["score"])
    return trends, correlation

# Load datasets (assuming they are provided as CSV files)
part_2a = pd.read_csv("part_2a.csv")
part_2b = pd.read_csv("part_2b.csv")

# Merge and analyze
merged_data = merge_datasets(part_2a, part_2b)
trends, correlation = analyze_pattern(merged_data)

print("Average Click Events by Final Result:")
print(trends)
print(f"\nCorrelation between Click Events and Score: {correlation}")


Average Click Events by Final Result:
final_result
Distinction    3114.203402
Fail            953.793794
Pass           2168.343137
Withdrawn      1119.588530
Name: click_events, dtype: float64

Correlation between Click Events and Score: 0.2751314422551821


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Task 7 - Detect and remove any outliers in the data used for your "interesting" pattern or trend

In [2]:
def remove_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    # Filter out outliers
    filtered_data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
    return filtered_data

# Remove outliers in click_events and score
cleaned_data = remove_outliers(merged_data, "click_events")
cleaned_data = remove_outliers(cleaned_data, "score")

print(f"Data after removing outliers: {cleaned_data.shape}")


Data after removing outliers: (23165, 10)


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>

### Task 8 - Define a hypothesis to test your “interesting” pattern or trend an test your hypothesis with statistical significance level of 0.05

In [3]:
from scipy.stats import linregress

def test_hypothesis(data):
    # Perform linear regression
    slope, intercept, r_value, p_value, std_err = linregress(data["click_events"], data["score"])
    return p_value

p_value = test_hypothesis(cleaned_data)

if p_value < 0.05:
    print("Reject the Null Hypothesis (H_0): Click events significantly impact student scores.")
else:
    print("Fail to Reject the Null Hypothesis (H_0): No significant impact of click events on scores.")


Reject the Null Hypothesis (H_0): Click events significantly impact student scores.


<p style="color:red; font-weight:bold; font-size:xx-small">MARK: __%</p>
<p style="color:red; font-weight:bold; font-size:xx-small">FEEDBACK: </p>