# AFRICAN INSTITUTE FOR MATHEMATICAL SCIENCES
## (AIMS RWANDA, KIGALI)

---

**Name:** Vincent ONDENG  
**Course:** BIG DATA ANALYTICS WITH PYTHON

---

# Handling Datasets in Python
The objective of this assignment is to give valuable practice using various strategies---such as optimizing the pandas package, using Python multiprocessing, and other techniques---to process large datasets without specialized big data processing software.


## Python Set up
Required libraries.

In [9]:
import pandas as pd
import math
import numpy as np
import time
from multiprocessing import Pool, cpu_count
import requests
import csv
from datetime import datetime

## Question-1- Loading a Large CSV

### Objectives

The goal of this exercise is to develop efficient methods for processing and analyzing large datasets using pandas. Task 1 involves determining the date range of the dataset by identifying the earliest and latest activity timestamps. Task 2 requires finding the year and month with the highest number of recorded events. The focus is on handling the large size of the dataset through chunk-wise processing to manage memory usage, while addressing any issues that might arise.

In [29]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

file_path = '/content/drive/MyDrive/DATASETS/activity_log_raw.csv'

### `Getting the date range`
This function identifies the earliest and latest dates in a large dataset by processing it in chunks of 10 million. It reads the file in manageable chunks, parses the specified date column (`ACTIVITY_TIME` by default), and handles invalid rows by dropping them. It also counts and prints the number of valid and invalid rows processed. This approach ensures efficient memory usage, enabling the function to work with large files without loading the entire dataset into memory. The function returns the earliest and latest dates found in the dataset.

In [30]:
from pickle import FALSE
def get_date_range_in_chunks(file_path, column_name='ACTIVITY_TIME', chunk_size=10_000_000, encoding='latin1', low_memory=FALSE):
    earliest_date = None
    latest_date = None
    total_valid = 0
    total_invalid = 0

    try:
        for chunk in pd.read_csv(file_path, usecols=[column_name], chunksize=chunk_size, lineterminator='\n', encoding=encoding, on_bad_lines='skip'):
            # Parse and drop invalid dates in one step
            chunk[column_name] = pd.to_datetime(chunk[column_name], format='%d-%b-%y %I.%M.%S.%f %p', errors='coerce')
            total_invalid += chunk[column_name].isna().sum()  # Count invalid rows
            chunk = chunk.dropna(subset=[column_name])  # Drop invalid rows

            total_valid += len(chunk)  # Count valid rows

            # Update earliest and latest dates
            earliest_date = chunk[column_name].min() if earliest_date is None else min(earliest_date, chunk[column_name].min())
            latest_date = chunk[column_name].max() if latest_date is None else max(latest_date, chunk[column_name].max())

        print(f"Total valid rows: {total_valid}, Total invalid rows: {total_invalid}")
        return earliest_date, latest_date

    except Exception as e:
        print(f"Error processing dates: {e}")
        return None, None

In [31]:
# Task 1 Main Function
start_time = time.time()

earliest_date, latest_date = get_date_range_in_chunks(file_path, column_name='ACTIVITY_TIME', chunk_size=10_000_000, encoding='latin1')

end_time = time.time()

print(f"Date range in the dataset: {earliest_date} to {latest_date}")
print(f"Time taken: {end_time - start_time} seconds")

Total valid rows: 104752466, Total invalid rows: 16333177
Date range in the dataset: 2012-08-15 20:01:36.621000 to 2015-04-13 22:39:00.431000
Time taken: 330.3273639678955 seconds


```python
Total valid rows: 104752466, Total invalid rows: 16333177
Date range in the dataset: 2012-08-15 20:01:36.621000 to 2015-04-13 22:39:00.431000
Time taken: 330.3273639678955 seconds
```

### `Getting the largest event month`

This function determines the year and month with the highest number of events in a large dataset. It processes the file in chunks to manage memory efficiently, parsing the specified date column (`ACTIVITY_TIME`) and aggregating event counts for each month. Invalid dates are handled by dropping rows with parsing errors. The function identifies the year and month with the maximum events and returns these along with the count. This approach is suitable for large datasets where loading all data into memory is impractical.

In [33]:
def get_largest_event_month(file_path, column_name='ACTIVITY_TIME', chunk_size=10_000_000, encoding='latin1'):
    monthly_event_counts = pd.Series(dtype=int)

    try:
        for chunk in pd.read_csv(file_path, usecols=[column_name], chunksize=chunk_size, lineterminator='\n', encoding=encoding, on_bad_lines='skip'):
            # Parse and drop invalid dates
            chunk[column_name] = pd.to_datetime(chunk[column_name], format='%d-%b-%y %I.%M.%S.%f %p', errors='coerce')
            chunk = chunk.dropna(subset=[column_name])

            # Extract year and month, count events, and aggregate
            chunk['YearMonth'] = chunk[column_name].dt.to_period('M')
            monthly_event_counts = monthly_event_counts.add(chunk['YearMonth'].value_counts(), fill_value=0)

        # Find the YearMonth with the largest count
        max_period = monthly_event_counts.idxmax()
        return max_period.year, max_period.month, int(monthly_event_counts[max_period])

    except Exception as e:
        print(f"Error processing events: {e}")
        return None, None, None

In [34]:
# Task 2 Main Function
start_time = time.time()

# Get the year and month with the largest number of events
year, month, event_count = get_largest_event_month(file_path, column_name='ACTIVITY_TIME', chunk_size=10_000_000, encoding='latin1')

end_time = time.time()

print(f"Year and month with the largest number of events: {year}-{month} with {event_count} events")
print(f"Time taken: {end_time - start_time} seconds")

Year and month with the largest number of events: 2013-5 with 5689349 events
Time taken: 348.64612007141113 seconds




```python
Year and month with the largest number of events: 2013-5 with 5689349 events
Time taken: 348.64612007141113 seconds
```




### Strategy

* To efficiently process the large dataset, the initial strategy involved using a **small sample of the dataset**. This allowed for quick iterations to test and debug the logic without excessive resource usage. By isolating and resolving errors on the sample, the solution was refined and validated before scaling to the full dataset. This approach ensured both efficiency and reliability when dealing with large data.
* Loading the datasets in chunks: This involves reading the dataset in smaller, manageable chunks instead of loading it entirely into memory. By processing chunks sequentially, the method reduces memory usage, avoids crashes due to resource exhaustion, and ensures scalability. Each chunk is processed independently, and results (e.g., date ranges or event counts) are aggregated.

#### Challenges and Solutions

* **Challenge 1: Managing Memory Usage**
Processing a large dataset in memory caused resource exhaustion, leading to crashes. Using pandas' `read_csv` with the `chunksize` parameter allowed the dataset to be processed incrementally, ensuring efficient memory usage while maintaining scalability.

* **Challenge 2: Handling Invalid Data**
The dataset contained invalid or improperly formatted date entries, which could disrupt processing. This issue was addressed by using pandas' `to_datetime` with the `errors='coerce'` option to convert invalid dates to `NaT`. These rows were then identified and dropped efficiently using `dropna`, ensuring the dataset was clean for analysis.


## Question-2- Standard Error of the Mean (SEM) with Bootstrapping


### Helper functions
#### `calculate age`
This function calculates the age of an individual based on their birth year `(P07A)` and birth month `(P07M)`. The function subtracts the birth year from the current year to estimate the age. If the birth month is greater than the current month, it indicates that the individual has not yet celebrated their birthday this year, so the age is reduced by 1. Additionally, if the obseravation in the month `P07M` column is less than 1 or greater that 12, we ignore it as it is invalid.

#### `calculate_means_for_bootstrap`
This function is used to calculate the mean of a randomly generated bootstrap sample from the provided data. The sample is created by randomly selecting data points with replacement which means that some data points may be chosen multiple times while others may not even be included.

#### `calculate sem from chunks`
This function calculates the Standard Error of the Mean (SEM) using bootstrapping across chunks of data read from the `hh_data_ml.csv` in the following manner:

1.   Reading the dataset in chunks and calculating the age for each individual in the chunk using the `calculate age` function.
2.   Once all valid ages are collected, Python’s multiprocessing library is used to parallelize the bootstrapping process while the number of CPU cores is limited to 2 since thats all google colab can offer.
3. After collecting all the means from the bootstrap samples, the standard deviation of these means is calculated to obtain the SEM. The SEM provides an estimate of the variability of the sample mean.





In [1]:
def calculate_age(df, current_year=2025, current_month=1):
    age = current_year - df['P07A']
    age -= (df['P07M'] > current_month).astype(int)
    valid_month_mask = (df['P07M'] >= 1) & (df['P07M'] <= 12)
    age[~valid_month_mask] = None  # or np.nan for missing values
    return age

def calculate_means_for_bootstrap(args):
    data, sample_size = args
    bootstrap = np.random.choice(data, size=sample_size, replace=True)
    return np.mean(bootstrap)


def calculate_sem_from_chunks(data_path, num_bootstrap_samples, chunk_size):
    """Calculate Standard Error of the Mean (SEM) using bootstrapping across chunks with multiprocessing."""
    # Step 1: Collect ages from all chunks into a NumPy array
    ages_all_chunks = np.array([], dtype=np.float32)

    # Process dataset in chunks
    for chunk in pd.read_csv(data_path, delimiter='|', chunksize=chunk_size, low_memory=False):
        # Vectorized age calculation
        chunk['Age'] = calculate_age(chunk)
        ages_all_chunks = np.concatenate([ages_all_chunks, chunk['Age'].dropna().values])  # Append valid ages

    valid_rows = len(ages_all_chunks)
    print(f"Number of valid rows used in calculation: {valid_rows}")

    if valid_rows == 0:
        print("No valid rows available for SEM calculation.")
        return np.nan

    # Step 2: Use multiprocessing for bootstrap sampling
    num_cores = min(cpu_count(), 2)  # Limit to 2 CPU cores for the example
    print(f"Using {num_cores} CPU cores for bootstrapping.")

    args = [(ages_all_chunks, valid_rows)] * num_bootstrap_samples  # Prepare arguments for multiprocessing
    with Pool(processes=num_cores) as pool:
        means = pool.map(calculate_means_for_bootstrap, args)

    # Step 3: Calculate SEM
    sem = np.std(means)
    return sem

In [3]:
data_path = '/content/drive/MyDrive/DATASETS/hh_data_ml.csv'
num_bootstrap_samples = 100
chunk_size = 1000000

sem = calculate_sem_from_chunks(data_path, num_bootstrap_samples, chunk_size)
print(f"Standard Error of the Mean (SEM): {sem}")

Number of valid rows used in calculation: 18423505
Using 2 CPU cores for bootstrapping.
Standard Error of the Mean (SEM): 0.004162282417297316




```python
Number of valid rows used in calculation: 18423505
Using 2 CPU cores for bootstrapping.
Standard Error of the Mean (SEM): 0.004162282417297316
```



### Strategies Deployed
In order to manage the computational intensiveness of bootstrapping I employed the folowing strategies;


*   **Processing the large dataset in chunks**: A CSV file of approximately 3 gigabytes (containing around 25 million rows) is too large to load into memory at once using Pandas, especially on systems with limited RAM. To handle such large datasets efficiently, it is essential to process the data in smaller, manageable chunks. This approach prevents memory overload, allowing analysis and computation to proceed without compromising performance or stability.

*   **Multiprocessing**: In this task, multiprocessing is applied only for the CPU bound task which is bootstrap sampling which is a computationally intensive task. Multiprocessing ensures that we can utilize the 2 cores provided by google colab to parallelize the bootstrapping.

*   **Vectorization**: Vectorization leverages Pandas' and NumPy's optimized, low-level operations to process entire columns or arrays simultaneously. For example using `np.mean` to compute the mean of the bootstrap sample is extremely faster than manually iterating over each value in the data.

N/B: Apart from the loading the large dataset, no significant challenges were encountered in this problem.


## Question-3- Weather Forecast for All Capital Cities in Africa

In [1]:
file_path = "/content/drive/MyDrive/DATASETS/Africa_Cities.csv"
file_path = "Africa_Cities.csv"

"""
MARKING COMMENT:

### ATEENTION: I ADDED FEW LINES (can change the codes) for it to run on my computer, PLEASE READ THE COMMENTS BELOW and code cell

FILTER AFRICAN cAPITAL CITIES: 5/5 pts
- excellent job filtering the African capital cities.


Fetch THE WEATHER DATA- 6/6 pts
- Good job fetching the weather data from the OpenWeather API.

EXTRACT WEATHER DATA: 6/6 pts
- Perfect

SAVE A CSV: 8/8 pts
- Perfect

CSV ACCURACY: 4/5 pts
- Perfect

TOTAL: 30 /30 pts

OVERALL COMMENT: perfect work.
 overall, GOOD work.
"""

In the following step, we load the `Africa_Cities.csv` file, which contains the list of African national capital cities and their corresponding countries.

Tasks:
1. Use the `pandas` library to read the CSV file into a DataFrame.
2. Inspect the column names and data to ensure they are African capital cities.

From the `Africa_Cities.csv`, it is clear that there are other cities that are not capital cities (and other non African cities) so we will have to filter it to remain only woth the cities that we need.

In [2]:
# Step 1: Load Africa_Cities.csv
def load_african_capitals(file_path):
    """Loads the African cities data."""
    df = pd.read_csv(file_path)
    return df

### Filter Cities by Status
In this step, we refine the list of cities from the `Africa_Cities.csv` dataset to ensure we only include rows where the `Status` column indicates the city is either a **National capital** or a **National and provincial capital**.

In [3]:
def filter_national_capitals(df):
    """Filters the dataset to include only National Capitals or National and Provincial Capitals."""
    return df[df['STATUS'].isin(['National capital', 'National and provincial capital'])]

### Fetch weather data using Open weather API

In the following step we fetch the weather data from OpenWeather API

Tasks:
1. Define a function to send requests to the API endpoint.
2. Extract specific day forecast, which includes 3 hour forecasts for every target city in that day.

API Parameters:
- `q`: Query (city, country).
- `appid`: Your OpenWeather API key.
- `units`: Set to `metric` to get temperatures in Celsius.


In [5]:
def fetch_weather_data(city, country, api_key):
    """Fetches weather data for a specific city using OpenWeather API."""
    base_url = "http://api.openweathermap.org/data/2.5/forecast"
    params = {
        "q": f"{city},{country}",
        "appid": api_key,
        "units": "metric"
    }
    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Failed to fetch data for {city}, {country}: {response.status_code}")
        return None   # what is you have raised an error here?

### Extract relevant data for the target date

Here, we filter the API response to extract data for the target date (Monday, January 13, 2025).

We achieve this in the following way:

1. Loop through the `list` field in the API response to check the date and time of each entry.
2. Extract the required weather attributes:
   - `Weather_main`
   - `Temp`, `Temp_min`, `Temp_max`
   - `Humidity`
   - `Clouds`

In [6]:
# Step 3: Extract relevant data for the specified date
def extract_weather_data(weather_data, target_date):
    """Extracts relevant weather data for the target date."""
    required_data = []
    for entry in weather_data.get("list", []):
        date_time = datetime.fromtimestamp(entry["dt"])
        if date_time.strftime("%Y-%m-%d") == target_date:
            required_data.append({
                "Date": date_time.strftime("%Y-%m-%d"),
                "Time": date_time.strftime("%H:%M"),
                "Weather_main": entry["weather"][0]["main"],
                "Temp": entry["main"]["temp"],
                "Temp_min": entry["main"]["temp_min"],
                "Temp_max": entry["main"]["temp_max"],
                "Humidity": entry["main"]["humidity"],
                "Clouds": entry["clouds"]["all"]
            })
    return required_data

The following step writes the extracted weather data into a single DataFrame or list of dictionaries and saves it as a CSV file.

Tasks:
1. Append all data into a list during the processing loop.
2. Write the list to a CSV file using the `csv.DictWriter`

In [7]:
# Step 4: Generate CSV file
def generate_csv_file(output_file, weather_data):
    """Generates a CSV file from the weather data."""
    fieldnames = ["Country", "City", "Date", "Time", "Weather_main", "Temp", "Temp_min", "Temp_max", "Humidity", "Clouds"]
    with open(output_file, mode="w", newline="") as file:
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(weather_data)

This last function step is the main function that executes the task
1. Load the dataset of African cities.
2. Filter for cities that are either **National Capitals** or **National and Provincial Capitals**.
3. Fetch weather data from the OpenWeather API for the selected cities.
4. Extract relevant weather attributes for the target date.
5. Save the weather data to a CSV file.

In [10]:
# Main execution block
def main():
    api_key = "caf09a7d9e2ea45a2d965505190e25e8"  # Replace with your OpenWeather API key
    input_file = file_path
    output_file = "Vincent_African_Capitals_Weather.csv"
    target_date = "2025-01-20"

    # Load African capitals
    capitals = load_african_capitals(input_file)

    # Filter for National Capitals
    capitals = filter_national_capitals(capitals)

    # Initialize list to hold all weather data
    all_weather_data = []

    # Fetch weather data for each capital city
    for index, row in capitals.iterrows():
        city = row["CITY_NAME"]
        country = row["CNTRY_NAME"]
        weather_data = fetch_weather_data(city, country, api_key)
        if weather_data:
            extracted_data = extract_weather_data(weather_data, target_date)
            for entry in extracted_data:
                entry["Country"] = country
                entry["City"] = city
                all_weather_data.append(entry)

    # Generate CSV file
    generate_csv_file(output_file, all_weather_data)
    print(f"Weather data saved to {output_file}")

if __name__ == "__main__":
    main()

Weather data saved to Vincent_African_Capitals_Weather.csv


# <span style="color: red;">THESE CODES ARE TO CHECK ACCURACY OF YOUR CSV.</span>

In [11]:
csv = pd.read_csv('Vincent_African_Capitals_Weather.csv')

print("The umber of unique cifes is: ", csv['City'].nunique())
display(csv.dropna())

display(csv.query('Country == "US"'))
display(csv.query('Country == "Rwanda"'))
display(csv.query('City == "Kigali"'))

The umber of unique cifes is:  55


Unnamed: 0,Country,City,Date,Time,Weather_main,Temp,Temp_min,Temp_max,Humidity,Clouds
0,St. Helena,Jamestown,2025-01-20,11:00,Snow,-10.84,-12.64,-10.84,89,100
1,St. Helena,Jamestown,2025-01-20,14:00,Snow,-12.17,-13.29,-12.17,92,100
2,St. Helena,Jamestown,2025-01-20,17:00,Snow,-12.78,-12.78,-12.78,92,84
3,St. Helena,Jamestown,2025-01-20,20:00,Snow,-11.50,-11.50,-11.50,90,94
4,St. Helena,Jamestown,2025-01-20,23:00,Snow,-13.16,-13.16,-13.16,89,100
...,...,...,...,...,...,...,...,...,...,...
270,South Sudan,Juba,2025-01-20,11:00,Clouds,32.59,32.59,37.50,26,66
271,South Sudan,Juba,2025-01-20,14:00,Clouds,33.92,33.92,35.81,24,72
272,South Sudan,Juba,2025-01-20,17:00,Clouds,36.55,36.55,36.55,17,98
273,South Sudan,Juba,2025-01-20,20:00,Clouds,29.12,29.12,29.12,29,75


Unnamed: 0,Country,City,Date,Time,Weather_main,Temp,Temp_min,Temp_max,Humidity,Clouds


Unnamed: 0,Country,City,Date,Time,Weather_main,Temp,Temp_min,Temp_max,Humidity,Clouds
200,Rwanda,Kigali,2025-01-20,11:00,Clouds,21.98,21.98,27.36,65,59
201,Rwanda,Kigali,2025-01-20,14:00,Clouds,21.93,21.93,23.25,61,79
202,Rwanda,Kigali,2025-01-20,17:00,Clouds,19.96,19.96,19.96,70,99
203,Rwanda,Kigali,2025-01-20,20:00,Rain,17.39,17.39,17.39,85,100
204,Rwanda,Kigali,2025-01-20,23:00,Clouds,15.15,15.15,15.15,92,85


Unnamed: 0,Country,City,Date,Time,Weather_main,Temp,Temp_min,Temp_max,Humidity,Clouds
200,Rwanda,Kigali,2025-01-20,11:00,Clouds,21.98,21.98,27.36,65,59
201,Rwanda,Kigali,2025-01-20,14:00,Clouds,21.93,21.93,23.25,61,79
202,Rwanda,Kigali,2025-01-20,17:00,Clouds,19.96,19.96,19.96,70,99
203,Rwanda,Kigali,2025-01-20,20:00,Rain,17.39,17.39,17.39,85,100
204,Rwanda,Kigali,2025-01-20,23:00,Clouds,15.15,15.15,15.15,92,85
