### Introduction

In this study, I present a dataset where it shows the daily average temperature of a lot of cities over almost 20 years. The idea here is to present to the reader some interesting python codes, some statistical tools, and some data exploration concepts. The motivation to write this study was to show how we can use statistics to explore data, and how some statistics tools can be used in a real-life.

First I start with some Initial Exploration (session 1) where I use some simple Python commands to know a little bit about the dataset we have. Then I present a Time series plot and outliers (session 1.1) where the idea is to answer the following question: How has the temperature changed over the years 1995 to 2014? Next (session 1.2) I use a Boxplot to answer What is the variation of the temperature in a year? Histograms and the concept of symmetry are presented in session 1.3.

Some probability (session 2) is discussed to answer these questions: what is the probability to have a temperature greater than 70F (21C) in a city? and How to know which city is colder than another one?

I use a simple question (How to know if a temperature is unusual?) to show an example of how to use statistics to make decisions (session 3). Then I return to Histograms (session 4) to show a normality test. I take advantage of the histograms and present an usage of the Central Limit Theorem (session 4.1)

Finally, on the Hypothesis test (session 5) I explain why I use the normality test to answer: How to compare the temperature of two cities?

In the next study, I'll use the concepts seen here to build a suggestion system. The final idea is to make an algorithm where to a given city, go to the dataset, and show the cities that have the same temperature profile.


In [None]:
#Here are all the libraries I'll use in this study

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from matplotlib import colors 
from matplotlib.ticker import PercentFormatter 
import numpy as np
import scipy
from scipy.stats import norm
from scipy import stats
import statsmodels.api as sm
import pylab
from scipy.stats import chi2_contingency 
from pandas import DataFrame
import seaborn as sns
from geopy.geocoders import Nominatim
from scipy.stats import shapiro
from scipy.stats import kstest 



In [None]:
df = pd.read_csv("../input/daily-temperature-of-major-cities/city_temperature.csv")


### 1. Initial Exploration



Let’s take a look at the first 5 rows of the dataframe created

In [None]:
df.head()

Beyond the Region, Country, State and City, we can see the month information (1= January, 2 = February and so on), the day of the month and the average temperature (in Fahrenheit) of that day.

The code below is to check the data type of each column. The reason for that is because we can see that we have some numbers on the table but we don't know if that is considered as numbers or strings. If it a string we should not be able to perform any calculation before converting it. This is an interesting step to start our analysis.

In [None]:
df.dtypes

From the information above we can see, for example, that the AgvTemperature is a float. It means that the numbers in that columns are considered as numbers not string.

Now, I would like to count how many countries and cities we have in this dataset

In [None]:
country_count = len(df['Country'].unique())
city_count = len(df['City'].unique())

print('The dataset has {0} countries and {1} cities'.format(country_count, city_count))

Let's see which Canada's cities are represented in the dataset

In [None]:
"""
If you want to check another country but are not so sure which 
countries we have represented in the dataset, first you need
to type and run the following command:

df['Country'].unique()

You can check the name of the countries and then change Canada
to a country you desire to check its cities
"""


df[(df.Country=='Canada')]['City'].unique()

### 1.1 Time series plot and outliers


How has the temperature changed over the years 1995 to 2014?

Now we're going to see a time series plot for the temperature of a particular city

In [None]:
# First, I'm going to create a dataframe for the city I want to take a look.

df_halifax = df[(df.City=='Halifax')]


# To create a time series plot we need a very specific format:

# 1) Join the colum month, day and year in just one column named "Date"
# 2) Convert this new colunm to a datatype string

df_halifax["Date"] = df_halifax["Month"].astype(str) +"/"+ df_halifax["Day"].astype(str) +"/"+ df_halifax["Year"].astype(str)

# 3) Converting the string column 'Date' into a new data type named 'Dateformat'
df_halifax['Date'] = pd.to_datetime(df_halifax['Date'])

#plotting a time-series
figure(num=None, figsize=(15, 10), dpi=80, facecolor='w', edgecolor='w')
plt.plot_date(df_halifax['Date'], df_halifax['AvgTemperature'])

plt.title('Halifax - Avg Temperature', fontsize=30)
plt.xlabel('Years', fontsize=30)
plt.ylabel('Dayly avg temperature (F)', fontsize=30)


It's a beautiful plot, we can see what we expected. The temperature goes up and down following a wave pattern. We also can see some outliers. Let's take a close look at it.

In [None]:
df_halifax['AvgTemperature'].describe()

The information above shows the min value for the AvgTemperature is -99 F. We can see those points on the plot above. This temperature doesn't exist in the real world.

In [None]:
# I'm going to remove those outliers from the original dataframe

indexNames = df[ df['AvgTemperature'] == -99 ].index
df.drop(indexNames , inplace=True)


# I'm going to select again all the temperature values for Halifax
df_halifax = df[(df.City=='Halifax')]

I'll describe the Halifax AvgTemperature again.

In [None]:
df_halifax['AvgTemperature'].describe()

Now we can see the temperatures go from -7.8F to 80.5F.

### 1.2 Boxplot, the variation of temperature and more outliers

What is the variation of the temperature in a year?

I'm going to plot the temperature month by month for just a year.

In [None]:
#filtering only the data related to the year 1995 (Halifax)
# There is nothing special in picking this year up ;-)

df_halifax_1995 = df_halifax[(df_halifax.Year==1995)]

#Ploting the data

figure(num=None, figsize=(8,5), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(df_halifax_1995['Month'], df_halifax_1995['AvgTemperature'],c=(0,0,0), alpha=0.5)


plt.title('Halifax - 1995 - Avg Temperature', fontsize=30)
plt.xlabel('Month', fontsize=30)
plt.ylabel('Dayly avg temperature (F)', fontsize=30)



plt.show()

We can see a kind of a wave trend (as expected) and we also can see the variation if the temperature of each month. It would be interesting to see the plot above in a better way.

In [None]:
data = []
for i in range(1,13):
    data_x = df_halifax_1995['AvgTemperature'][(df_halifax_1995.Month ==i)]
    data.append(data_x)

fig = plt.figure(figsize =(10, 7)) 
  

ax = fig.add_axes([0, 0, 1, 1]) 
  
# Creating plot 
bp = ax.boxplot(data) 

plt.title('Halifax - 1995 - Avg Temperature', fontsize=30)
plt.xlabel('Month', fontsize=30)
plt.ylabel('Dayly avg temperature (F)', fontsize=30)
  

plt.show() 

That is a boxplot showing how the temperature changes throughout the year. I would like to explain a little bit about **1) boxplots** and **2) some interpretation of the results we've seen in the plot above.**

1) The rectangles shown in the plot are the interquartile range. Let's see the boxplot for January. The upper side of the rectangle is around 33F. This upper side is the 3rd quartile (or the 75th percentile). It means that 75% of the temperature in January is below 33F. Or if you wish, 25% of the time the temperature is above 33F. The lower side of the rectangle shows the 1st quartile (or the 25th percentile). It means that 25% of the temperature in January is below to 18F (approximately).

So, with that information, we can see that 50% of the data is in the range of the rectangle. 50% of the time the temperature is between 33F and 18F.

If we take a look at the temperature of July, for example, we're going to see that the range of the temperature is smaller than in January. It means that the temperature in July is more concentrated, grouped (or less distributed). Going back to January, the red line (around 28F) is the median (or the 2nd quartile also known as 50% percentile). The idea is the same as the 1st and 3rd quartile. It means that 28F is in the middle, 50% of the temperatures in January are above 28F and 50% is below it.

Let's take a look at the minimum and maximum shown on the boxplot. We can see, taking January as an example, that we have -3F and 53F as a minimum and maximum, respectively. Those values are calculated based on the difference between the 3rd and 1st quartile (size of the rectangle known as an interquartile range). These min and max are interesting because can show us the values that are outside that range. Those values are known as outliers. If you take a look at the boxplot for March, April, July, and October you would see some outliers. Those values are considered unusual. This doesn't mean that we should remove those values, the decision to keep or delete them is up to the person who is analyzing the data.


2) The first thing I would like to highlight is the range between the maximum and minimum temperatures. You can see that the range for January is much bigger than the range for July. It means that the variation of temperature in January is bigger than July, where the difference between the higher and the lower temperature of the month is not so big as January is.

The second thing to highlight is the outliers. The temperature of 45F and 50F is an unusual temperature for March but not for January.

### 1.3 Histogram and symmetry

Lastly, I'd like to mention the symmetry/asymmetry of the data. If we take a look, for example in February, you will see that the median is almost in the middle of the rectangle range while, in April, the median is closer to the lower side of the rectangle. It can reveal to us the symmetry of the data. To a better understanding of this, let's take a look at the histogram for April and February.¶

In [None]:
#Histogram for April

df_halifax_april_1995 = df_halifax_1995[(df_halifax_1995.Month==4)]

n_bins = 7
   
# Creating histogram 
fig, axs = plt.subplots(1, 1, 
                        figsize =(8,4),  
                        tight_layout = True) 
  
axs.hist(df_halifax_april_1995['AvgTemperature'], bins = n_bins, ec='black') 
plt.xticks(fontsize = 20) 
plt.yticks(fontsize = 20)

plt.title('Halifax - April - 1995 ', fontsize=30)


#Histogram for February

df_halifax_february_1995 = df_halifax_1995[(df_halifax_1995.Month==2)]

   
# Creating histogram 
fig, axs = plt.subplots(1, 1, 
                        figsize =(8, 4),  
                        tight_layout = True) 
  
axs.hist(df_halifax_february_1995['AvgTemperature'], bins = n_bins, ec='black') 
plt.xticks(fontsize = 20) 
plt.yticks(fontsize = 20)

plt.title('Halifax - February - 1995 ', fontsize=30)


  
# Show plot 
plt.show() 

2) (cont) You can see that the distribution for February looks more symmetrical than the distribution for April, which seems to be more concentrated on the right of the chart. That characteristic is well shown when we take a look at the position of the median concerning the rectangle **(in the boxplot chart)**

###   2.Probability

Let's work with probabilities

what was the probability to have a temperature bigger than 70 (21C) in Halifax in 1995?

In [None]:
# creating a function to perform that calculation
def temperature_probability(dataframe,temperature):
    temp = dataframe[(dataframe.AvgTemperature > temperature)]['AvgTemperature'].count()
    total = dataframe['AvgTemperature'].count()
    print('The probability to have a temperature bigger than '+str(temperature) + 'F is: '+ str( round((temp/total)*100,1 ))+'%')


In [None]:
temperature_probability(df_halifax_1995,70)

The probability was calculated acording to the number of times we had that condition divided by the total number of values we have in that year. For example. We had 14 days in 1995 where the temperature was above 70F and the total of day in that year was 365. Thus, 14 / 365 = 3.8%

I'm going to thake advantage of the calculation power of computers and test that probability. The idea is randonly to take a temperature from that particular year and check if it is bigger than 70. I'll do it 100,000 times and calculate the percentage of how many times the temperature was above 70.

In [None]:
import random
count_days = 0
max_times = 100000

for i in range(0,max_times):
    rand_temp = random.choice(list(df_halifax_1995.AvgTemperature))
    if rand_temp > 70:
        count_days +=1

percent = round((count_days/max_times)*100,2)       
        
print(str(percent)+'%')

So, if we pick a random day in Halifax(1995) we'll have 3.8% of change to have a temperarute above 70.

In [None]:
temperature_probability(df_halifax_1995,14)

Let's see other city

In [None]:
temp = df[(df.City=='Winnipeg')]
df_winnipeg_1995 = temp[(temp.Year==1995)]

In [None]:
temperature_probability(df_winnipeg_1995,70)

In [None]:
temperature_probability(df_winnipeg_1995,14)

How to know which city is colder than another one?

If we calculate from the result above 100% - 74% = 26%. This means that we had 26% of the year where the temperature LESS than 14F while in Halifax 100% - 96.2% = 3.8%. Winnipeg had in 1995, 15% colder days than Halifax (3.8 / 26 = 15%)



### 3.Using statistics to take decisions

Now, let's take a look at the average temperatue of January from 1995 to 2014 (I'm still working with the data from Halifax)

In [None]:
df_grouped_by_city_month_year = df.groupby(['City','Year', 'Month'])

# saving the temperatures from 1995 to 2014 (Halifax, January)
temps_of_january_Halifax = []
for year in range(1995,2015):
    mean_temp = df_grouped_by_city_month_year.get_group(('Halifax',year,1))['AvgTemperature'].mean()
    temps_of_january_Halifax.append(mean_temp)

figure(num=None, figsize=(8, 5), dpi=80, facecolor='w', edgecolor='k')
plt.grid()

plt.plot(range(1995,2015),temps_of_january_Halifax)  #plot the line graph
plt.plot(range(1995,2015),temps_of_january_Halifax,"or") #plot the points - "o" = circles and "r" = red 


plt.title('Halifax -  January avg temperature from 1995 to 2014', fontsize=20)
plt.xlabel('Year', fontsize=15)
plt.ylabel('Avg temperature (F)', fontsize=15)

Looks like the temperature on January changed a lot. Let's change the range of the y axis.

In [None]:
a= plt.figure()
axes= a.add_axes([0,0,1,1]) #this controls the size of the picture
plt.plot(range(1995,2015),temps_of_january_Halifax)  #plot the line graph
plt.plot(range(1995,2015),temps_of_january_Halifax,"or") #plot the points - "o" = circles and "r" = red 
axes.set_ylim([-10,80])  #Function to set a limit in the y axis. From -10 to 80.

plt.title('Halifax -  January avg temperature from 1995 to 2014', fontsize=20)
plt.xlabel('Year', fontsize=15)
plt.ylabel('Avg temperature (F)', fontsize=15)

This is the same graph showed above and it shows the variaton of temperature through the years. Doesn't seem it changed too much, this is just my observation, to take a decision if that change in the temperature is significant or not, we need to look for some statistical tests.

How to know if a temperature is unusual?

In [None]:
df_temps_of_january_Halifax = DataFrame(temps_of_january_Halifax,columns=['Temp'])
df_temps_of_january_Halifax['Temp'].plot(kind = 'box')

There is an outlier. It means that temperature is statistically anusual for that group of data

In [None]:
df_temps_of_january_Halifax['Temp'].describe()

In [None]:
yearlist=[]
for i in range(1995,2015):
     yearlist += [i]
df_temps_of_january_Halifax['Year'] = yearlist
df_temps_of_january_Halifax.set_index('Year')

df_temps_of_january_Halifax[df_temps_of_january_Halifax['Temp'] == min(df_temps_of_january_Halifax['Temp'])]

Thus, on 2014 we had anusual average tmperature for January.

### 4.Histograms and normality test

Let's take a look at the distribution of the data in the whole year.

In [None]:
n_bins = 10
   
# Creating histogram 
fig, axs = plt.subplots(1, 1, 
                        figsize =(8, 4),  
                        tight_layout = True) 
  
axs.hist(df_halifax_1995['AvgTemperature'], bins = n_bins, ec='black') 
plt.xticks(fontsize = 20) 
plt.yticks(fontsize = 20)

plt.title('Halifax -  1995 ', fontsize=20)

  
# Show plot 
plt.show() 

We can see from the distribution above that is not a normal distribution. A normal distribution plays a lot of roles in statistics and knowing some distribution is normal, it allows us to perform some interesting statistical tests. One way to test if a distribution is normal or not is called the Q-Q plot. Let's plot the Q-Q plot to see if that distribution is normal. We also going to use the Shapiro-Wilk test to test if the distribution is normal. The reason to use two tests is that the Q-Q plot provides a visual test, and sometimes it's not easy to decide just based on the plot. The Shapiro-Wilk test will work as a counter-test.

In [None]:
#this is a qqplot to test if our sample is normal

my_data = df_halifax_1995['AvgTemperature']
sm.qqplot(my_data, line='s')
plt.title('Halifax -  1995 ', fontsize=20)
pylab.show()

If a distribution is considered normal all the points (in blue in the plot above) should rely on the diagonal line (in red). Let's perform another test to check our expectations.

In [None]:
#Just because I'm going to use the Shapiro-Wilk test more than once,
#I'll create a function to make my life easier

# Shapiro-Wilk Test
# inspired from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/

def shapiro_wilk(data):
    stat, p = shapiro(data)
    print('Statistics=%.3f, p=%.3f' % (stat, p))

    # interpret
    alpha = 0.05
    if p > alpha:
        print('Sample is normal')
    else:
        print('Sample is not normal')

In [None]:
shapiro_wilk(df_halifax_1995['AvgTemperature'])

The core of the Shapiro-Wilk test is the p-value. We used the code above to calculate the p-value associated with our distribution then we compare that value to an alpha value, known as the level of significance. This alpha is an arbitrary value and by convention is commonly set to 0.05 (5%). In that test, if p is bigger than 0.05 then the distribution is normal. In our sample p=0.000 therefore the sample is not normal.

Just as a comparison, I will plot a Q-Q plot for a well known normal distribution.

In [None]:
#as an example, this is a qqplot for a normal distribuited sample
my_data = norm.rvs(size=1000)  #I'm using this norm.rvs() function to create a normal sample
sm.qqplot(my_data, line='s')

plt.title('Normal distribuited sample', fontsize=20)

pylab.show()

In [None]:
shapiro_wilk(norm.rvs(size=1000)) #Using the function I created above

In that case, p=0.480 which is higher than 0.05 thus the sample is normal, as expected.

### 4.1 The Central Limit Theorem


Speaking of normalization and distribution, I would like to take advantage of it and show you the central limit theorem in action. We saw above that the distribution of the temperature values through the year is not a normal distribution but if we perform the following calculation you'll see that the mean of the average temperature follows a normal distribution.

The code below takes a sample of 100 random temperature values and calculates the mean of that sample. Then I repeat the process 3000 times.


In [None]:
values = []
for i in range(3000):
    selected_sample = df_halifax.loc[(df_halifax['Year'] == 1995)]['AvgTemperature'].sample(n=100, replace=True)
    mean_sample = selected_sample.mean()
    values += [mean_sample]
    total_mean = sum(values)/len(values) 
    SD = np.std(values)
print('Mean: '+str(round(total_mean,2)))
print('Standard deviation: '+str(round(SD,2)))
plt.hist(values)

plt.title('Histogram - Halifax 1995 - Central Limit Theorem', fontsize='15')



plt.show()

The distribution (histogram) of this process is shown above. This is an example of the use of the central limit theorem where states that the distribution of the mean is normally distributed. I'll apply the Shapiro-Wilk test and plot the Q-Q plot to check the normality of the distribution.

In [None]:
df_values = pd.DataFrame(values) #this creates a dataframe with all the values of the histogram
my_data = df_values[0] #selecting just the column with the values
sm.qqplot(my_data, line='s')

plt.title('Halifax 1995 - Central Limit Theorem', fontsize=20)

pylab.show()

In [None]:
shapiro_wilk(my_data)

### 5.Hypothesis test

How to compare the temperature of two cities?

Where could we use the information about the normality of a distribution?

The original idea when I started studying this dataset was to use some statistical tools to select two cities that are similar based on their annual temperature. In general to do this task we could use the t-test (or Student's t-test). However, to compare two cities using the t-test we need first to make sure if the temperature distribution is normal. In that case, we saw that the distribution is not normal. Thus, I need to look for another stats tool that allows me to compare two different groups. The option is the Kolmogorov-Smirnov test.

Here is an example of how to use the Kolmogorov-Smirnov test.

I'll first compare the temperature of Halifax in two different years 1995 and 2013.

Before that, I'll see the average temperature for both years.

In [None]:
df_halifax = df[(df.City=='Halifax')]

df_halifax_1995 = df_halifax[(df_halifax.Year==1995)]

df_halifax_2013 = df_halifax[(df_halifax.Year==2013)]

In [None]:
df_halifax_2013['AvgTemperature'].mean()

In [None]:
df_halifax_1995['AvgTemperature'].mean()

We can see that the average temperature of both years are very close to each other. Now, let's apply the test.

In [None]:
stats.ks_2samp(df_halifax_1995['AvgTemperature'],df_halifax_2013['AvgTemperature']) 

The result shows that these two groups are the same because the p-value is > 0.05. The idea behind the t-test and the Kolmogorov-Smirnov test is the hypothesis test, where we have a null hypothesis (often called H0) and an alternative hypothesis (Ha). In my case my H0 is "There is no difference between the groups" and the Ha is "There is difference between the groups". Then, we run the test and it calculates the p-value. With this result, we decide based on a threshold (named alpha) if we will reject or no the null hypothesis. In general, a common value to alpha is 0.05. It means, in the case above where the p-value is 0.82 we will fail to reject H0. In other words, we will not reject the hypothesis H0. So, we'll assume that those groups of values (the group temperatures of 1995 and 2013) are similar. Just a quick note, when we say that we "reject to fail H0" this doesn't mean that we ACCEPT the hypothesis. I think this is a matter for another article and it's not the scope of the present study.

Let's see a case where the groups are different.

In [None]:
df_beijing = df[(df.City=='Beijing')]
df_beijing_1995 = df_beijing[(df_beijing.Year==1995)]

In [None]:
stats.ks_2samp(df_halifax_1995['AvgTemperature'],df_beijing_1995['AvgTemperature']) 

See the p-value is close to 0 (3.2 x 10^-12). In that case (p-value<0.05) these two groups are considered different.

On next study we'll use the Kolmogorov-Smirnov test to buid a suggestion system
