# Task 4.1 : Data Exploration

## 4.1.1 - Get to know your Data

Before doing any (pre)processing, it is important to get familiar with your data,know your data and find relation between data.

### Data  Objects and attributes
Python is an object oriented programming language, this means objects are Python’s abstraction for data. All data in a Python program is represented by objects or by relations between objects.This notebook we wil use the class **DataFrame** to define our data objects.

A data object represents the entity. Data Objects are like group of attributes of a entity. 
A attribute can be seen as a data field that represent characteristics, features or variables or a data object.
The attributes can be subdivided in two types

**Types of Attributes:**
* **Qualitative attributes:** describe an object feature, without providing a measurable size/quantitive/numeric value 
     )
     
* **Quantitative attributes:** measurable quantitative representing the object attribute 

We will load two datasets, one containing weather data and the other containing city details information.
The entire process will be illustrated in the script below, where we download weather data using Meteonorm. Meteonorm is a software that collects accurate weather data and representative typical years for any place on earth. You can choose from more than 30 different weather parameters. The DEMO version allows us to download data for only 5 cities in 2005: Bern, Johannesburg, San Francisco, Perth and Brasilia.

**City Weather Data - Quantitative attributes**

The data is stored in comma-separated files called *city*-hour.dat. 
Here we have selected the following 7 main weather parameters:
- Global radiation ($W/m^2$)
- Diffuse radiation ($W/m^2$)
- Temperature ($°C$)
- Wind speed ($m/s$)
- Relative humidity ($\%$)
- Cloud cover ($oktas$)
- Precipitation ($mm$)

**City Details Data - Qualitative attributes**

We load a second data, with the following parameters:
- City Name 
- Country
- Language
- Climate
- Cost of living
- Main Sport
- Florestation
- Hemesphire


**NOTE 1** In meteorology, an okta is a unit of measurement used to describe the amount of cloud cover. 
SKC = Sky clear (0 oktas); FEW = Few (1 to 2 oktas); SCT = Scattered (3 to 4 oktas); BKN = Broken (5 to 7 oktas); OVC = Overcast (8 oktas)

**NOTE 2** City details data was filled with online sources, the information may not be updated.


In [None]:
#@title Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#Create City Details data
City_Name = ['Bern','San Francisco','Johannesburg','Perth', 'Brasilia' ]
Country = ['Switzerland','United States of America', 'South Africa','Australia', 'Brasil' ]
Language = ['German','English', 'English','English', 'Portuguese' ]
Climate = ['Continental', 'Mediterranean','SubTropical Highland','Mediterranean','Tropical Savanna']
Cost_of_living = ['$$$', '$$$','$','$$','$']
Main_Sport = ['Soccer','Basketball', 'Football','Football', 'Soccer' ]
Florestation = ['Pines', 'Oak','Baobab','Eucalypt','Pernambuco']
Hemesphire = ['North','North','South', 'South', 'South']


columns = ['City Name', 'Country' , 'Language','Climate' 'Main Sport', 'Cost of living', 'Florestation','Hemesphire']
d = {'City Name': City_Name, 'Country': Country, 'Language': Language, 'Climate':Climate, 'Main Sport': Main_Sport, 'Cost of living':Cost_of_living, 'Florestation':Florestation, 'Hemesphire':Hemesphire }

CityDataset = pd.DataFrame(data=d)
CityDataset.to_csv("./weather-data/CityDataset.dat", index=False, header = True, sep = ",")

`create_dataframe()` is a function that assigns names to the imported dataset's columns and creates a DateTimeIndex for better manipulation of the time-series data. The function also creates and additional **class** column to distinguish each city.

In [None]:
#@title Helper function
def create_dataframe(dataframe, cls):
    
    import datetime as dt
    
    dataframe.columns = ['year', 'month', 'day', 'hour', 'global radiation', 'diffuse radiation', 
                         'temp', 'wind speed','relative humidity', 'cloud cover', 'precipitation']
    
    datetime = dataframe.loc[:, 'year':'hour']
    
    # the original data has hours in the 1 --> 24 format, but datetime accepts only 0 --> 23
    datetime['hour'] = datetime['hour'] - 1
    
    datetime['DateTime'] = datetime.apply(lambda row: dt.datetime(row.year, row.month, row.day, row.hour), axis=1)
    datetime['DateTime'] = pd.to_datetime(datetime.DateTime)
    dataframe.index = pd.DatetimeIndex(datetime.DateTime)
    
    # include the class column for each city
    dataframe['class'] = np.full(shape=dataframe.shape[0], fill_value=cls)
    
    # delete the first four columns, they are not needed now that there is a DateTimeIndex
    dataframe = dataframe.drop(['year', 'month', 'day', 'hour'], axis=1)
    
    return dataframe

In [None]:
#@title Load Weather data
df1 = pd.read_table('./weather-data/Bern-hour.dat', sep=',', header=None)
df2 = pd.read_table('./weather-data/Johannesburg-hour.dat', sep=',', header=None)
df3 = pd.read_table('./weather-data/SanFrancisco-hour.dat', sep=',', header=None)
df4 = pd.read_table('./weather-data/Perth-hour.dat', sep=',', header=None)
df5 = pd.read_table('./weather-data/Brasilia-hour.dat', sep=',', header=None)
df1 = pd.read_table('./weather-data/Bern-hour.dat', sep=',', header=None)

bern_weather = create_dataframe(df1, cls=0)
perth_weather = create_dataframe(df4, cls=1)
johannesburg_weather = create_dataframe(df2, cls=2)
sanfrancisco_weather = create_dataframe(df3, cls=3)
brasilia_weather = create_dataframe(df5, cls=4)ß

#Load City Details data  for simplicity
CityDataset = pd.read_table('./weather-data/CityDataset.dat', sep=',')
bern_details = CityDataset.loc[CityDataset['City Name'] == 'Bern' ]
sanfrancisco_details = CityDataset.loc[CityDataset['City Name'] == 'San Francisco' ]
johannesburg_details = CityDataset.loc[CityDataset['City Name'] == 'Johannesburg' ]
perth_details = CityDataset.loc[CityDataset['City Name'] == 'Perth' ]
brasilia_details = CityDataset.loc[CityDataset['City Name'] == 'Brasilia' ]



For each city, we have two data
objects **'City'_weather'** and  **'City'_details** containing city weather and city details respectively.

Display the data and get familiar with it

In [None]:
#Visualize our data
print("Bern details\n",bern_details )

print("\nBern weather\n",bern_weather )

#Visualize data using dtypes
print("\nBern city details dtypes")
print (bern_details.dtypes)

print("\nBern city weather dtypes")
print(bern_weather.dtypes)


## Task 4.1.2 -  Similarity
Similarity is the measure of how alike two data objects are. It is a very importnat concept in data exploration and a basic block of unsupervised learning like clustering classification etc.

However similarity is subjective and is highly dependent on the domain and application. For example, one could could say two cities, like Perth and San Francisco are similar because both speak english and have booth a mediterranean climate. 

This represent a simple way to view similarity as:

    Similarity = 1 if X.attribute = Y.attribute        (Where X, Y are two objects)
    Similarity = 0 if X.attribute ≠ Y.attribute

Implent a function called **equal_attribute(x , y, attribute)**, that returns True if two cities have the same the attribute value.

In [None]:
# TO DO
def equal_attribute(x,y,attribute):
   

In [None]:
# Does Perth and SanFrancisco have same language? 
equal_attribute(perth_details,sanfrancisco_details,'Language')


Expected output : True

Imagine if we define the similary level as the number of times a attribute is equal betwen two cities. This is a defenition based on qualitative attributes of the city details data.

Implement a function **get_city_similarity_level(x ,y )**, that receives two cities details as inputs and returns the level of similary betwen them.

In [None]:
## TO DO
def get_city_similarity_level(x,y):


Using the previous function display the level of similarty of Johannesburg to the other cities.

In [None]:
print ("Johannesburg Similarity Level to:")

print("Bern:",get_city_similarity_level(johannesburg_details, bern_details) )
print("San Francisco:",get_city_similarity_level(johannesburg_details, sanfrancisco_details) )
print("Perth:",get_city_similarity_level(johannesburg_details, perth_details) )
print("Brasilia:",get_city_similarity_level(johannesburg_details, brasilia_details) )


## Task 4.1.3 - Proximity Measures
In the previous task, we define a very simple similarity measure based on qualitative attributes.

However, in most cases, our attributes are quantitative, as is the of our weather data.
The similarity in this context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be a high degree of similarity; if this distance is large, there will be a low degree of similarity.

The most common example is the Euclidian distance:

**Euclidian distance**

$ d(i,j) =\sqrt{(x_{i1} - x_{j1})² + (x_{i2} - x_{j2})² +(x_{id} - x_{jd})²}$

where $d(i,j)$ is a distance metric between attributes of samples $i$ and $j$.

Implement a function **euclidian_distance(x, y)**, that return a euclidean distance between two data series.

In [None]:
#TO DO
# Implement euclidian distance function
from math import *
def euclidian_distance(x,y):


In [None]:
#Check function
euclidian_distance([0,1,2,0],[1,4,2,5]) 


Expected ouput: 5.916079783099616

Display  euclidian distance betwen temperature in Johannesburg and the other cities.


In [None]:
print ("Johannesburg Temperature euclidian distance to:")
print ("Perth:", euclidian_distance(johannesburg_weather.temp,perth_weather.temp) )
print ("Bern:",euclidian_distance(johannesburg_weather.temp,bern_weather.temp) )
print ("San Francisco:",euclidian_distance(johannesburg_weather.temp,sanfrancisco_weather.temp) )
print ("Brasilia:", euclidian_distance(johannesburg_weather.temp,brasilia_weather.temp) )

We see that closest city in temperature to Johannesburg is Perth, and the frasthest is Bern.
This means the temperature is is more similar to Johannesburg.

### Plot the Temperature
We can cofirm our proximity meseaure by plotting the temperature of the cities.

As expected it can be observed temperatue in Perth over the year is considerabily more similar to Jhoannesburg that the temperature in Bern.


In [None]:
# 3.3 plot and compare the temperature between original data and replaced data
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(15, 5))
ax1.set_title('Perth & Johannesburg - Similar Temperatures')
ax1.plot(johannesburg_weather.temp, label="Johannesburg")
ax1.plot(perth_weather.temp, label="Perth")
ax1.set_ylabel(r'Temperature ($^\circ$C)')
ax1.legend()

ax2.set_title('Bern & Johannesburg -- Not Simimilar Temperatures')
ax2.plot(johannesburg_weather.temp, label="Johannesburg")
ax2.plot(bern_weather.temp, label="Bern")
ax2.set_ylabel(r'Temperature ($^\circ$C)')
ax2.legend()

plt.show()


Now we want to define a simililary metric between the weather in two cities. In this case we want the similary between not only the temperature, but also the precepitation, wind, etc, all the variables that constitute the weather.

Write a **weather_similarty(city1, city2)** function that return the weather similarty between two cities.

In [None]:
#TO DO
def get_weather_similarty(city1, city2):
    

Use your new function to display the weather similarly of Perth to the other cities.

Which city has more similar weather? and more different?  Does our weather similarity match the climate in the cities?

In [None]:
print ("Perth Weather Similarity to:")
print ("Bern", get_weather_similarty(perth_weather,bern_weather) )
print ("San Francisco", get_weather_similarty(perth_weather,sanfrancisco_weather) )
print ("Johannesburg", get_weather_similarty(johannesburg_weather,perth_weather) )
print ("Brasilia", get_weather_similarty(perth_weather,brasilia_weather) )


CityDataset[['City Name','Climate']]

Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each feature must be normalized, or one feature could end up dominating the distance

### Other Proximity Measures

Besides the Euclidan distance, there are lot of other similarity distance measures, like:

**Cosine Similarity:**

It is defined as the cosine of the angle between two vectors.
The cosine similarity is advantageous because even if the two similar data series are far apart by the Euclidean distance, chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

$ Cos \theta = \frac{u .v}{||u||.||v||}$,

where $u,v$ are two vectors projected in a multi-dimensional space. 

**Manhatam distance**

Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates.

$ |x_{i1} - x_{j1}| + |x_{id} - x_{jd}|  $


You can **scipy.spatial.distance** library to import some of this meseaures


In [None]:
x = [3,4,5,6]
y = [4,5,6,7]

#Cosine distance
from scipy.spatial.distance import cosine
print("Cosine distance", cosine(x,y) )

#Manhattan distance
from scipy.spatial.distance import cityblock
print("Manhattan distance", cityblock(x,y) )

#Euclidian distance
from scipy.spatial.distance import euclidean
print("Euclidian distance", euclidean(x,y) )

#Minkowski distance - Generalisation of the Euclidan and Manhattan
from scipy.spatial.distance import minkowski
print("Minkowski distance", minkowski(x,y) )

## Task 4.1.4  - Data Relationships 

Besides similarly, there are multiple relationships in our data. To find these relations in the data is a cornerstone in the understanding of the many concepts and methods in pattern recognition and statistics.

Variables within a dataset can be related for lots of reasons, for example:

-One variable could cause or depend on the values of another variable.

-One variable could be lightly associated with another variable.

-Two variables could depend on a third unknown variable.

Therefore to study the statistical relationship between two variables we define convariance and correlation.

### Convariance 

Covariance is a measure of how much two random variables vary together (e.g the temperature and precipitation in a city). In the covariance matrix the off-diagonal elements contain the covariances of each pair of variables. The diagonal elements of the covariance matrix contain the variances of each variable. 

    If COV(xi, xj) = 0 then variables are uncorrelated.
    If COV(xi, xj) > 0 then variables positively correlated 
    If COV(xi, xj) < 0 then variables negatively correlated 

If the covariance is positive then the variables grow together, while a negative covariance means they move inversely. A null value means variables are independent.


### Variance
Variance measures the variation of a single random variable (e.g temperature in a city), how much the data are scattered about the mean.
The variance is equal to the square of the standard deviation.

### Correlation

While covariance indicates the direction of the linear relationship between variables, correlation measures both the strength and direction of the linear relationship between two variables. correlation is a function of covariance.


Let's start by plotting the precipitation in the city of Brasilia with respect to the cloud cover.

In [None]:
#plot precipitation for cloud cover
x = brasilia_weather['cloud cover']
y = brasilia_weather['precipitation']
plt.scatter(x,y)
plt.title('Brasilia')
plt.ylabel('Precipitation ($mm$)')
plt.xlabel('Cloud cover ($oktas$)')


As expected, it can be observed the data is highly correlated. Rain and clouds are natural phenomenons that generally speaking occur simultaneously.

Use the NumPy function **np.cov(x, y)** , to determine the covariance between precipitation and cloud cover in the city of Brasilia

In [None]:
#TO DO 


Display the calculated covariance in a heatmap (use **seaborn.heatmap**).

In [None]:
#TO DO
import seaborn as sn


What about temperature and wind speed?  Is there a correlation between the two, in a city like Brasilia?

Plot the scatter graph and calculate the covariance between precipitation and temperature in the city of Brasilia.

In [None]:
#TO DO


In [None]:
#TO DO 


### Pearson's Correlation

The Pearson correlation coefficient can be used to summarize the strength of the linear relationship between two data samples.

### Spearman’s Correlation
Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables.

Further, the two variables being considered may have a non-Gaussian distribution. In this case, the Spearman’s correlation coefficient (named for Charles Spearman) can be used to summarize the strength between the two data samples. This test of relationship can also be used if there is a linear relationship between the variables, but will have slightly less power (e.g. may result in lower coefficient scores).


Use **scipy.stats import pearsonr** and **scipy.stats import spearmanr**  to calculate a correlation between two variables of your choice.
Interpret the values, plot the variables to confirm your results.

In [None]:
from scipy.stats import pearsonr

#TO DO


In [None]:
from scipy.stats import spearmanr

#TO DO


## Task 4.1.4 - Hypothesis Testing

A common problem in applied machine learning is determining whether input features are relevant to the outcome to be predicted. This is the problem of feature selection, we saw in the previous taks that convariance and correlation can be used to learn relations between quantitative data.

However if instead of having temperature mesaures, we only have describtion if the day was cold or hot. This is called a categorical variable, it is a variable that can take on one of a limited, and usually fixed, number of possible values.

In the case of classification problems where input variables are also categorical. 
We can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and removed from the dataset.

## Task 4.1.4.1 - Pearson’s Chi-squared
The Pearson’s Chi-squared statistical hypothesis is an example of a test for independence between categorical variables.



### Categorical Data

In this task, instead of having our precise weather measures, we only have the observations of an ordinary person.
This person only destiguied the days as hot or cold, and for a full year he counted the number days it rained, was cloud or sunny.

This results in a table with **Qualitative attributes** more precisely **nominal attributes**

Run the following code and display the table CaliTable with days counted.

In [None]:
#@Helper Function
#Counting the hot/cold days and if it rains or not

def count_weather_days(city_weather):
    
    #Downsample by max value in a day
    city_weather_day = city_weather.resample('D').max()

    rainy_days = len(city_weather_day[(city_weather_day['precipitation']> 2 )])   
    cloudy_days = len(city_weather_day[(city_weather_day['precipitation']> 0.0001) & (city_weather_day['precipitation']<= 1)])                    
    sunny_days = len(city_weather_day[(city_weather_day['precipitation']<= 0.0001)])   
    
    hot_days = len(city_weather_day[(city_weather_day['temp']>17)])   
    cold_days = len(city_weather_day[(city_weather_day['temp']<= 17)])  
    
    hot_sunny_days = len(city_weather_day[(city_weather_day['temp']>17) & 
                                                 (city_weather_day['precipitation']<= 0.0001) ])   
    cold_sunny_days = len(city_weather_day[(city_weather_day['temp'] <= 17) & 
                                                 (city_weather_day['precipitation']<= 0.0001) ])
    
    hot_rainy_days = len(city_weather_day[(city_weather_day['temp']>17) & 
                                                 (city_weather_day['precipitation']> 2 ) ]) 
    cold_rainy_days = len(city_weather_day[(city_weather_day['temp']<= 17) & 
                                                 (city_weather_day['precipitation']> 2 ) ])

    hot_cloudy_days = hot_days - hot_rainy_days - hot_sunny_days
    cold_cloudy_days = cold_days - cold_rainy_days -cold_sunny_days
    
    Sunny = [hot_sunny_days,cold_sunny_days]
    Cloudy = [hot_cloudy_days,cold_cloudy_days]
    Rainy = [hot_rainy_days,cold_rainy_days ]
    Total =[hot_days, cold_days]

    columns = ['Hot Days','Cold days']
    index = ['Hot Days','Cold days']
    d = {'Sunny days': Sunny, 'Cloudy days': Cloudy, 'Rainy days': Rainy }
    
    Table_Simple =pd.DataFrame(data=d,index=index,columns=None)
    
    Sunny = [hot_sunny_days,cold_sunny_days, sunny_days]
    Cloudy = [hot_cloudy_days,cold_cloudy_days, cloudy_days]
    Rainy = [hot_rainy_days,cold_rainy_days, rainy_days ]
    Total =[hot_days, cold_days, (hot_days+ cold_days)]

    columns = ['Hot Days','Cold days', 'Total']
    index = ['Hot Days','Cold days', 'Total']
    d = {'Sunny days': Sunny, 'Cloudy days': Cloudy, 'Rainy days': Rainy, 'Total': Total }
    Table_with_Total = pd.DataFrame(data=d,index=index,columns=None)
    
    return (Table_Simple, Table_with_Total )

In [None]:
CaliTable, CaliTable_with_Total = count_weather_days(sanfrancisco_weather)

CaliTable


|           	| Sunny days 	| Cloudy days 	| Rainy days 	| Total 	|
|:---------:	|:----------:	|:-----------:	|:----------:	|:-----:	|
|  Hot Days 	|     205    	|      7      	|      4     	|  216  	|
| Cold days 	|     95     	|      25     	|     29     	|  149  	|
|   Total   	|     300    	|      19     	|     33     	|  365  	|


Nice and warm San Francisco with 300 sunny days, and only 149 colds days a year. 

A table summarization of two categorical variables in this form is called a contingency table.
The table was called a contingency table because the intent is to help determine whether one variable is contingent upon or depends upon the other variable. 

For example, does a hot day (temperature) depend on good weather( rain/sun), or are they independent?
Can you answer that from the table alone? To cofirm, we will use Pearson’s Chi-Squared test.

Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. 


In this example, the number of observations for a category (hot or cold) is not the same. Nevertheless, we can calculate the expected frequency of observations, determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. 

In the standard applications of this test, the observations are classified into mutually exclusive classes. The result of the test can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.

If the null hypothesis is that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a chi-squared ($χ2$) frequency distribution. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true. 
The variables are considered independent if the observed and expected frequencies are similar


We will use **chi2_contingency() SciPy function** that takes as input representing the contingency table for the two categorical variables, and returns the calculated statistic and p-value for interpretation as well as the calculated degrees of freedom and table of expected frequencies.

In [None]:
from scipy.stats import chi2_contingency
from scipy.stats import chi2

stat, p, dof, expected = chi2_contingency(CaliTable)
print("statistic",stat)
print("p-value",p)
print("degres of fredom: ",dof)
print("table of expected frequencies\n",expected)


### Interpret the test-statistic

For example, if we assume the variable is independent, with a probability of 90%, suggesting that the finding of the test is quite likely. If the statistic is less than or equal to the critical value, we can fail to reject this assumption, otherwise it can be rejected.

    If p-value <= critical: significant result, reject null hypothesis (H0), dependent.
    If p-value > critical: not significant result, fail to reject null hypothesis (H0), independent


In [None]:
prob = 0.90
critical = chi2.ppf(prob, dof)
if abs(stat) >= critical:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

We can also interpret the p-value by comparing it to a chosen significance level, which would be 10%, calculated by inverting the probability used in the critical value interpretation.

In [None]:
# interpret p-value
alpha = 1.0 - prob
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (fail to reject H0)')

The  Chi-Square Test confirms that the variables are dependent.