### Project Overview: Analyzing Climate Change in Africa

In this project, we'll be working with the **'Climate Change in Africa' dataset**, provided by the U.S. Global Change Research Program. This dataset contains valuable historical data on daily minimum, maximum, and average temperature fluctuations across five African countries: **Egypt, Tunisia, Cameroon, Senegal,** and **Angola**, spanning from **1980 to 2023**.

📊 **Dataset Description**: The data offers insights into temperature trends and patterns across the selected countries, presenting an opportunity to explore and visualize climate variations over the years.

➡️ [**Dataset Link**](https://drive.google.com/file/d/1I8eV4-8p61CNNlVJzzho2xeoZ5-P7Q0F/view)

---

### Instructions

1. **Load the Dataset**  
   Begin by importing the dataset into a DataFrame using Python.

2. **Data Cleaning**  
   Perform necessary data cleaning to ensure accuracy and consistency in your analysis.

3. **Line Chart Visualization**  
   Create a line chart to display the average temperature fluctuations in **Tunisia** and **Cameroon**. Analyze and interpret the observed trends.

4. **Time Frame Focus (1980-2005)**  
   Zoom in on the data between **1980 and 2005**, and customize the axes labels for better clarity.

5. **Histograms of Temperature Distribution**  
   Generate histograms showing the temperature distribution in **Senegal**, comparing the periods **1980-2000** and **2000-2023** within the same figure. Summarize the key insights.

6. **Country-Wise Temperature Visualization**  
   Choose the most appropriate chart type to represent the **average temperature per country**.

7. **Exploratory Analysis**  
   Formulate your own questions about the dataset and explore answers using relevant visuals.


In [3]:
import warnings
warnings.filterwarnings("ignore")

### Importing necessary libraries

In [5]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

### Load the dataset

In [11]:
# Load the dataset into a DataFrame
df = pd.read_csv('Africa_climate_change.csv')

# Display the first few rows of the dataset to confirm it has loaded correctly
df.head()

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY
0,19800101 000000,,54.0,61.0,43.0,Tunisia
1,19800101 000000,,49.0,55.0,41.0,Tunisia
2,19800101 000000,0.0,72.0,86.0,59.0,Cameroon
3,19800101 000000,,50.0,55.0,43.0,Tunisia
4,19800101 000000,,75.0,91.0,,Cameroon


### EDA and Cleaning

In [13]:
# Brief description of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 464815 entries, 0 to 464814
Data columns (total 6 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   DATE     464815 non-null  object 
 1   PRCP     177575 non-null  float64
 2   TAVG     458439 non-null  float64
 3   TMAX     363901 non-null  float64
 4   TMIN     332757 non-null  float64
 5   COUNTRY  464815 non-null  object 
dtypes: float64(4), object(2)
memory usage: 21.3+ MB


In [5]:
# A summary statistics of all the columns in the df
df.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
DATE,464815.0,15938.0,20150619 000000,37.0,,,,,,,
PRCP,177575.0,,,,0.120941,0.486208,0.0,0.0,0.0,0.01,19.69
TAVG,458439.0,,,,77.029838,11.523634,-49.0,70.0,80.0,85.0,110.0
TMAX,363901.0,,,,88.713969,13.042631,41.0,81.0,90.0,99.0,123.0
TMIN,332757.0,,,,65.548262,11.536547,12.0,58.0,68.0,74.0,97.0
COUNTRY,464815.0,5.0,Senegal,183262.0,,,,,,,


In [27]:
df[df["COUNTRY"] == 'Tunisia'].describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
DATE,79301.0,15932.0,19800101 000000,5.0,,,,,,,
PRCP,79301.0,,,,0.032066,0.187422,0.0,0.0,0.0,0.0,12.6
TAVG,79301.0,,,,67.736422,12.772479,-49.0,57.0,67.0,79.0,105.0
TMAX,62209.0,,,,78.513013,14.481664,41.0,66.0,78.0,90.0,123.0
TMIN,55257.0,,,,57.169426,12.017871,12.0,47.0,57.0,67.0,90.0
COUNTRY,79301.0,1.0,Tunisia,79301.0,,,,,,,


In [29]:
# Display 30 random rows
df.sample(n = 30)

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY
34530,19831111 000000,0.0,89.0,105.0,75.0,Senegal
7379,19801020 000000,0.0,72.0,,63.0,Egypt
442548,20210807 000000,0.0,85.0,91.0,81.0,Egypt
125416,19930512 000000,0.0,83.0,98.0,73.0,Senegal
297378,20090407 000000,0.0,63.0,78.0,,Egypt
264998,20060505 000000,0.0,64.0,72.0,,Tunisia
394764,20170422 000000,0.0,83.0,102.0,64.0,Egypt
240873,20040406 000000,0.0,59.0,65.0,,Egypt
287770,20080618 000000,0.0,86.0,98.0,76.0,Tunisia
332712,20120224 000000,0.0,77.0,94.0,62.0,Egypt


In [31]:
# Convert DATE column to datetime format
df['DATE'] = pd.to_datetime(df['DATE'], format='%Y%m%d %H%M%S', errors = 'coerce')

In [None]:
df.head()

### Handling the missing values

In [33]:
df['COUNTRY'].unique()

array(['Tunisia', 'Cameroon', 'Senegal', 'Egypt', 'Angola'], dtype=object)

In [35]:
# Group by country and calculate summary statistics for TAVG
country_tavg_stats = df.groupby('COUNTRY')['TAVG'].agg(['mean', 'std'])

# Display the summary statistics
country_tavg_stats

Unnamed: 0_level_0,mean,std
COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1
Angola,75.930017,5.05393
Cameroon,80.315806,7.098227
Egypt,73.839069,12.512259
Senegal,82.964651,6.407171
Tunisia,67.736422,12.772479


In [None]:
# Group by country and calculate summary statistics for TMAX
country_tmax_stats = df.groupby('COUNTRY')['TMAX'].agg(['mean', 'std'])

# Display the summary statistics
country_tmax_stats

In [None]:
# Group by country and calculate summary statistics for TMIN
country_tmin_stats = df.groupby('COUNTRY')['TMIN'].agg(['mean', 'std'])

# Display the summary statistics
country_tmin_stats

- We can group the dataset by country and then fill in the missing values for temperature columns using the mean for that specific country, since temperatures can vary significantly by region

###### Note that we can go further to groupby particular months of the year for each country, but we stop here for now

In [None]:
# Fill missing temperature values with the mean of each country
df['TAVG'] = df.groupby('COUNTRY')['TAVG'].transform(lambda x: x.fillna(x.mean()))
df['TMAX'] = df.groupby('COUNTRY')['TMAX'].transform(lambda x: x.fillna(x.mean()))
df['TMIN'] = df.groupby('COUNTRY')['TMIN'].transform(lambda x: x.fillna(x.mean()))

# Check if the missing values are filled
print(df.isnull().sum())

##### Before we replace the missing values in PRCP let's see if there're relationships with the other columns

In [25]:
# To check the correlation between temperature and precipitation

correlation_tavg_prcp = df[['TAVG', 'PRCP']].corr().iloc[0, 1]
print(f"Correlation between TAVG and PRCP: {correlation_tavg_prcp}")

correlation_tmax_prcp = df[['TMAX', 'PRCP']].corr().iloc[0, 1]
print(f"Correlation between TMAX and PRCP: {correlation_tmax_prcp}")

correlation_tmin_prcp = df[['TMIN', 'PRCP']].corr().iloc[0, 1]
print(f"Correlation between TMIN and PRCP: {correlation_tmin_prcp}")

Correlation between TAVG and PRCP: -4.0758321644984994e-06
Correlation between TMAX and PRCP: -0.03421041454823679
Correlation between TMIN and PRCP: 0.059982205674322116


The results show a very weak correlation let's check for countries

##### Relationship with countries

In [15]:
# Group by country and calculate correlation between PRCP and TAVG
country_prcp_tavg_corrs = df.groupby('COUNTRY').apply(lambda x: x['PRCP'].corr(x['TAVG'])).reset_index(name='PRCP_TAVG_Corr')

# Group by country and calculate correlation between PRCP and TMIN
country_prcp_tmin_corrs = df.groupby('COUNTRY').apply(lambda x: x['PRCP'].corr(x['TMIN'])).reset_index(name='PRCP_TMIN_Corr')

# Group by country and calculate correlation between PRCP and TMAX
country_prcp_tmax_corrs = df.groupby('COUNTRY').apply(lambda x: x['PRCP'].corr(x['TMAX'])).reset_index(name='PRCP_TMAX_Corr')

# Merge the correlation results into a single DataFrame
country_corrs = pd.merge(country_prcp_tavg_corrs, country_prcp_tmin_corrs, on='COUNTRY')
country_corrs = pd.merge(country_corrs, country_prcp_tmax_corrs, on='COUNTRY')


country_corrs

Unnamed: 0,COUNTRY,PRCP_TAVG_Corr,PRCP_TMIN_Corr,PRCP_TMAX_Corr
0,Angola,-0.05748,-0.066272,0.061548
1,Cameroon,-0.103467,-0.00809,-0.168081
2,Egypt,-0.096758,-0.053969,-0.108276
3,Senegal,-0.107566,-0.029322,-0.143023
4,Tunisia,-0.097896,-0.045028,-0.120163


##### Still no correlation

In [19]:
# Group by country and calculate summary statistics for PRCP
country_prcp_stats = df.groupby('COUNTRY')['PRCP'].agg(['mean', 'median', 'min', 'max', 'std', 'count'])

# Display the summary statistics
country_prcp_stats

Unnamed: 0_level_0,mean,median,min,max,std,count
COUNTRY,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Angola,0.104949,0.0,0.0,15.35,0.475049,2924
Cameroon,0.412796,0.08,0.0,19.69,0.909221,15649
Egypt,0.018206,0.0,0.0,9.93,0.165231,47647
Senegal,0.243219,0.0,0.0,19.29,0.677917,46456
Tunisia,0.039182,0.0,0.0,12.6,0.206503,64899


We see that there's a variation in the PRCP values across the countries

##### Replace missing values with the median precipitation for each country

In [21]:
# Calculate the median precipitation for each country
median_prcp_by_country = df.groupby('COUNTRY')['PRCP'].median()

# Define a function to fill missing values with the country-specific median
def fill_missing_prcp(row):
    if pd.isna(row['PRCP']):
        return median_prcp_by_country[row['COUNTRY']]
    else:
        return row['PRCP']

# Apply the function to fill missing values
df['PRCP'] = df.apply(fill_missing_prcp, axis=1)

# Verify the changes
df.head()

Unnamed: 0,DATE,PRCP,TAVG,TMAX,TMIN,COUNTRY
0,19800101 000000,0.0,54.0,61.0,43.0,Tunisia
1,19800101 000000,0.0,49.0,55.0,41.0,Tunisia
2,19800101 000000,0.0,72.0,86.0,59.0,Cameroon
3,19800101 000000,0.0,50.0,55.0,43.0,Tunisia
4,19800101 000000,0.08,75.0,91.0,,Cameroon


## Visualizations

### Create a line chart to display the average temperature fluctuations in Tunisia and Cameroon.

In [23]:
# Extract year, month, and day
df['Year'] = df['DATE'].dt.year
#df['Month'] = df['DATE'].dt.month
# Format the month names as "Jan", "Feb", etc.
df['Month'] = df['DATE'].dt.strftime('%b')
df['Day'] = df['DATE'].dt.day

df.head()

AttributeError: Can only use .dt accessor with datetimelike values

In [None]:
# Filter the data for Tunisia and Cameroon
df_filtered = df[df['COUNTRY'].isin(['Tunisia', 'Cameroon'])]

# Group by 'YEAR' and 'COUNTRY', then calculate the average temperature
df_yearly = df_filtered.groupby(['Year', 'COUNTRY'])['TAVG'].mean().reset_index()

df_yearly

In [None]:
# Filter the data for Tunisia and Cameroon
df_filtered = df[df['COUNTRY'].isin(['Tunisia', 'Cameroon'])]

# Group by 'YEAR' and 'COUNTRY', then calculate the average temperature
df_yearly = df_filtered.groupby(['Year', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_yearly, x='Year', y='TAVG', color='COUNTRY',
              title='Average Yearly Temperature Fluctuations in Tunisia and Cameroon',
              labels={'Year': 'Year', 'TAVG': 'Average Temperature (°F)'})

fig.update_layout(legend_title_text='Country')
fig.show()

#### By month

In [None]:


# Group by 'Month' and 'COUNTRY', then calculate the average temperature
df_monthly = df_filtered.groupby(['Month', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_monthly, x='Month', y='TAVG', color='COUNTRY',
              title='Average Monthly Temperature Fluctuations in Tunisia and Cameroon',
              labels={'Month': 'Month', 'TAVG': 'Average Temperature (°F)'})

fig.update_layout(legend_title_text='Country')
fig.show()

In [None]:
# Define the correct order for months
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Convert 'Month' to a categorical type with the specified order
df_filtered['Month'] = pd.Categorical(df_filtered['Month'], categories=month_order, ordered=True)

# Group by 'Month' and 'COUNTRY', then calculate the average temperature
df_monthly = df_filtered.groupby(['Month', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_monthly, x='Month', y='TAVG', color='COUNTRY',
              title='Average Monthly Temperature Fluctuations in Tunisia and Cameroon',
              labels={'Month': 'Month', 'TAVG': 'Average Temperature (°F)'})

fig.update_layout(legend_title_text='Country')
fig.show()

### Zoom in on the data between 1980 and 2005, and customize the axes labels for better clarity.

In [None]:
# Filter the data between 1980 and 2005
df_filtered_2 = df_filtered[(df_filtered['Year'] >= 1980) & (df_filtered['Year'] <= 2005)]

# Group by 'Year' and 'COUNTRY', then calculate the average temperature
df_yearly = df_filtered_2.groupby(['Year', 'COUNTRY'])['TAVG'].mean().reset_index()

# Create the line chart
fig = px.line(df_yearly, x='Year', y='TAVG', color='COUNTRY',
              title='Average Yearly Temperature Fluctuations in Tunisia and Cameroon (1980-2005)',
              labels={'Year': 'Year', 'TAVG': 'Average Temperature (°F)'})

# Customize x-axis and y-axis labels
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Average Temperature (°F)',
    xaxis=dict(
        tickmode='array'
    )
)

### Temperature Distribution in Senegal (1980-2000 vs 2000-2023)

In [None]:
# Filter the data for Senegal
senegal_df = df[df['COUNTRY'] == 'Senegal']

# Extract the year from 'DATE'
senegal_df['Year'] = senegal_df['DATE'].dt.year

# Split the data into two periods: 1980-2000 and 2000-2023
senegal_1980_2000 = senegal_df[(senegal_df['Year'] >= 1980) & (senegal_df['Year'] <= 2000)]
senegal_2000_2023 = senegal_df[(senegal_df['Year'] > 2000) & (senegal_df['Year'] <= 2023)]


# Create histograms
hist_1980_2000 = go.Histogram(
    x=senegal_1980_2000['TAVG'],
    opacity=0.6,
    name='1980-2000',
    marker=dict(color='blue')
)

hist_2000_2023 = go.Histogram(
    x=senegal_2000_2023['TAVG'],
    opacity=0.6,
    name='2000-2023',
    marker=dict(color='red')
)

# Combine histograms in one figure
fig = go.Figure(data=[hist_1980_2000, hist_2000_2023])

# Update layout for better visibility
fig.update_layout(
    barmode='overlay',
    title='Temperature Distribution in Senegal (1980-2000 vs 2000-2023)',
    xaxis_title='Average Temperature (°F)',
    yaxis_title='Frequency',
    legend_title_text='Period',
    legend=dict(
        x=0.05, y=0.95,
        bgcolor='rgba(255, 255, 255, 0.5)'
    )
)

# Show the figure
fig.show()

From the above Line chart we can see the Tunisia has a lower average Temperature compared to Cameroon
- The peak for Cameroon was in 1991
- The peak for Tunisia was in 1999

### Country-Wise Temperature Visualization

In [1]:
# Group by country and calculate the average temperature
country_avg_temp = df.groupby('COUNTRY')['TAVG'].mean().reset_index()

# Create a bar chart
fig = px.bar(country_avg_temp, x='COUNTRY', y='TAVG', 
             title='Average Temperature per Country',
             labels={'TAVG': 'Average Temperature (°F)', 'COUNTRY': 'Country'})

# Customize the layout for better clarity
fig.update_layout(xaxis_title='Country', yaxis_title='Average Temperature (°F)', 
                  xaxis_tickangle=-45, 
                  title_font_size=20)

# Show the figure
fig.show()

NameError: name 'df' is not defined