# Introduction

In this notebook I lay out my analysis of the [Daily Temperature of Major Cities](https://www.kaggle.com/sudalairajkumar/daily-temperature-of-major-cities) dataset which contains the average daily temperatures (in Fahrenheit degrees) for 157 U.S. and 167 international cities.

First, we will go through a simple data cleaning process and, then, we will gather interesting insights by exploring the dataset over a period of 20 years (from 2000 to 2019).

Please feel free to leave your thoughts and feedbacks in the comment section below. As a beginner, I would love constructive criticism to help me and other newbies grow! :) 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
import seaborn as sns
import warnings

warnings.filterwarnings("ignore")
sns.set_style('darkgrid')

In [None]:
df = pd.read_csv('../input/daily-temperature-of-major-cities/city_temperature.csv')
df.head()

# Data Cleaning Process

We start off by renaming all columns to lowercase letters.

In [None]:
df = df.rename(columns={
    'Region': 'region',
    'Country': 'country',
    'State': 'state',
    'City': 'city',
    'Month': 'month',
    'Day': 'day',
    'Year': 'year',
    'AvgTemperature': 'avg_temp'
})
df.head()

We check the columns characterizing the dataset, while keeping an eye out for any missing values or data artifacts.

In [None]:
df.region.unique()

In [None]:
df.country.nunique()

In [None]:
df.country.isna().any()

In [None]:
df.state.unique()

In [None]:
df.city.nunique()

In [None]:
df.city.isna().any()

In [None]:
df.avg_temp.isna().any()

In [None]:
df.day.unique()

In [None]:
len(df[df.day == 0])

A `day` value equal to `0` looks like an invalid value. Therefore, we drop the rows containing it.

In [None]:
df = df[df.day != 0]

In [None]:
df.year.unique()

In [None]:
len(df[df.year.isin([200, 201])])

We do the same for the `year` values equal to `200` and `201`.

In [None]:
df = df[~df.year.isin([200, 201])]

In [None]:
df.avg_temp.value_counts().nlargest(10)

The `avg_temp` column appears to contain some kind of measurement error, namely a value equal to `-99.0`.

As a matter of fact, according to [this link](http://academic.udayton.edu/kissock/http/Weather/source.htm), it is used to symbolize a missing measurement!

Thus, we compute its relative frequency for each city as to determine the **relevance of the error value**.

In [None]:
grouped_freqs = df.groupby('city').avg_temp.value_counts(normalize=True)
grouped_freqs.head()

In [None]:
sorted_freqs_df = grouped_freqs.unstack(level='avg_temp').sort_values(by=-99., ascending=False)
sorted_freqs_df[-99.].head(15)

We notice how the first 5 cities are the ones being the most affected by the missing measurements.

In light of what we have just discovered, let's only keep the rows corresponding to the cities having a relative frequency of the missing measurements $\leq 10 \%$.

In [None]:
cities_to_remove = sorted_freqs_df[sorted_freqs_df[-99.] > .1][-99.].index.to_list()
print(f'We are going to remove the following cities:\n\n{", ".join(cities_to_remove)}.')
df = df[~df.city.isin(cities_to_remove)]

Replace the missing measurements with something more appropriate.

In [None]:
df.avg_temp = df.avg_temp.replace(-99., np.nan)

Now that we have the NaN in place, it would be interesting to determine the **top-10 cities having the highest number of missing temperature measurements for an entire month**.

In [None]:
# We first compute the counts of the temperatures values...
groupby_filter = [df.year, df.month, 'city']
temp_counts_df = df.groupby(groupby_filter).avg_temp.value_counts(dropna=False).unstack()
# Then, we select the rows whose NaN column equals 31.
# These are the rows corresponding to a missing temperature measurement
# for the combination of year/month/city.
missing_series = temp_counts_df.loc[temp_counts_df[np.nan] == 31.][np.nan]
# Finally, we replace all the 31 values with 1
# in order to compute the total number of occurrences
# of missing temperature measurements, for each city.
outages_series = missing_series.replace(31., 1).unstack('city').sum().nlargest(10)
# Do some cleanup...
del temp_counts_df
# And the plotting of the cities
# against the total number of occurrences of this event for the overall dataset.
fig, ax = plt.subplots(figsize=(12,10))
outages_series.sort_values().plot(kind='barh', rot=30, ax=ax)
ax.set_title('Cities having the highest no. of missing temperature measurements for 31 days straight.')
ax.set_xlabel('Number of occurrences')
ax.yaxis.label.set_visible(False)

Since there are several cities missing entire months' worth of temperature data, I believe that there are two possible ways to proceed:

* Drop all the rows having missing temperature readings, OR;
* "Forward" propagate the last non-null temperature value until the next non-null value. 

I decided on the latter.

In [None]:
df.avg_temp = df.avg_temp.fillna(method='ffill')

We convert Fahrenheit to Celsius, just for fun.

In [None]:
to_celsius_equation = (df.avg_temp - 32) * 5 / 9
df = df.assign(avg_temp_c=np.round(to_celsius_equation, 2)).drop(columns='avg_temp')

And we finalize the cleaning process by merging `year`, `month` and `day` columns into a new `date` column.

In [None]:
datetime_series = pd.to_datetime(df[['year', 'month', 'day']])
df['date'] = datetime_series
df = df.drop(columns=['year', 'month', 'day'])
df = df.set_index('date')
df.head()

# Data Analysis

## Mean yearly regional temperature trend from 2000 to 2019
We begin the analysis by visualizing the mean yearly temperature trend from 2000 to 2019, for each region.

In [None]:
last_twenty_years_df = df[(df.index.year >= 2000) & (df.index.year <= 2019)]
region_year_filter = ['region', pd.Grouper(freq='Y')]
last_twenty_years_region_temp = last_twenty_years_df.groupby(region_year_filter).avg_temp_c \
                                                    .mean()                                 \
                                                    .unstack('region')
last_twenty_years_region_temp.head()

In [None]:
# I had to override the Matplotlib's tick_values method of the YearLocator class
# because the date interval I'm analyzing (i.e. 2000 to 2019)
# cannot be drawn correctly using the default YearLocator.
#
# The implementation is very similar to the default YearLocator.
# In fact, the most relevant change is on the following line:
# ymax = self.base.ge(vmax.year) * self.base.step + self.base.step
# There, I've added the term self.base.step

class OddIntervalYearLocator(mdates.YearLocator):
    def tick_values(self, vmin, vmax):
        ymin = self.base.le(vmin.year) * self.base.step
        ymax = self.base.ge(vmax.year) * self.base.step + self.base.step
        ticks = [vmin.replace(year=ymin, **self.replaced)]
        while True:
            dt = ticks[-1]
            if dt.year >= ymax:
                return mdates.date2num(ticks)
            year = dt.year + self.base.step
            ticks.append(dt.replace(year=year, **self.replaced))

In [None]:
def plot_yearly_temperature(df, fig_title, fig_size, marker_type=None):
    fig, ax = plt.subplots(figsize=fig_size)
    if marker_type is None:
        sns.lineplot(data=df, markers=True, markersize=10, dashes=False, ax=ax)
        # Force legend position when there are multiple lines on the plot
        ax.legend(loc='center left')
    else:
        sns.lineplot(data=df, marker='o', markersize=10, ax=ax)
    ax.set_title(fig_title)
    ax.set_ylabel('Temperature (Celsius)')
    ax.xaxis.label.set_visible(False)
    # Draw a major tick every 2 years
    ax.xaxis.set_major_locator(OddIntervalYearLocator(base=2, month=12, day=31))
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

In [None]:
plot_yearly_temperature(last_twenty_years_region_temp[['Europe', 'Middle East', 'Africa']],
                        fig_title='Mean yearly temperature (2000 - 2019) - EMEA',
                        fig_size=(12, 10))

In [None]:
other_df = last_twenty_years_region_temp[['North America', 'South/Central America & Carribean',
                                          'Asia', 'Australia/South Pacific']]
plot_yearly_temperature(other_df,
                        fig_title='Mean yearly temperature (2000 - 2019) - Other regions',
                        fig_size=(12, 10))

We notice how Africa, Middle East and South/Central America & Carribean are undergoing a bumpy but steady increase of the average temperature per year. On the other hand, Europe and Asia remain somewhat constant.

## Mean yearly Earth's temperature trend from 2000 to 2019
The next visualization focuses on the average Earth's temperature trend over the same time period we just saw. 

In [None]:
plot_yearly_temperature(last_twenty_years_df.groupby(pd.Grouper(freq='Y')).avg_temp_c.mean(),
                        fig_title='Mean yearly temperature (2000 - 2019) - Earth',
                        fig_size=(10, 5),
                        marker_type='o')

As expected, the average Earth's temperature is alarmingly increasing and, during the last two decades, it appears that there has been an increase of $\sim 0.5$ Celsius degree of the average Earth's temperature.

## Top-10 cities having the highest/lowest mean temperature from 2000 to 2019

We now shift our attention to the top-10 cities having the highest/lowest mean temperature from 2000 to 2019.

In [None]:
last_twenty_years_df.groupby('city').avg_temp_c.mean().nlargest(10)

In [None]:
last_twenty_years_df.groupby('city').avg_temp_c.mean().nsmallest(10)

## Hottest days in India
Finally, we highlight the hottest days in the major cities of India. As you may know, India has several of the world's hottest places. 

In [None]:
last_twenty_years_df.query('country == "India"').groupby('city').avg_temp_c \
                    .agg(hottest_day='idxmax', avg_temp_c='max')