# Data Science Capstone project

This notebook will contain the IBM capstone project, currently just week 1 but it will be expanded to contain both weeks 1 and 2. The week 1 part will contain the introduction, plan and data description of the project, whereas the second week will contain the actual implementation.

# Week one - introduction and plan


Our capstone project will consist of studying the distribution of relative education levels in Finland. In particular, we wish to see if there is any correlation on clustering of areas based on education level and the amount of coffee shops found in the area.


### A description of the problem and a discussion of the background.

Finland is a country of both  [relatively high education levels](https://en.wikipedia.org/wiki/Education_in_Finland) and [relatively large coffee consumption](https://www.caffeineinformer.com/caffeine-what-the-world-drinks). In this project we wish to see if a nuanced educational profile various finnish areas has any correlation with the estimated amount of coffee shops in that area. 

Our motivation rises from the fact that coming from the observation that mathematicians seem to consume very large amounts of coffee, even by finnish standards, and we're interested to see if there would be a more quantifiable connection between levels of education and coffee consumption. 

In particular, we wish to cluster Finnish postal code areas based on their "educational profile", by which we mean a vector that gives the relative amounts of adults in the given region that have passed certain levels in the finnish education system. Such data could be used, in theory, by a coffee chain conglomerate the figure out how to spread their coffee shops around the Finnish neighbourhoods.

### A description of the data and how it will be used to solve the problem. 


We use the [database services provided by the finnish national center of statistics](https://www.stat.fi/org/avoindata/pxweb.html). The data se we acquired is indexed with the postal code area and contains the following fields:
1. The location of the postal code area (rough center point)
2. The area of the postal code area.
3. The total adult population of the postal code area
4. The number of all people who have been educated in the area
5. The number of people who have finished basic education
6. The number of people who have passed high school
7. The number of people who have passed trade school
8. The number of bachelor diplomas
9. The number of master's diplomas
10. The amount of professional, scientific and techincal action. (I'm not quite sure how this is actually measured, probably in relation to companies, but took it as an extra clusterin parameter.)
11. The student population of the area.
    
    
The fields 1. and 2. are used for locational purposes, the fields 2. and 3. for scaling the data and fields 5.-11. for clustering. 
We study two cases where the fields 5.-11. are first divided either by the area of the postal code or the adult population of the postal code. After this the columns 5.-11. will be normalized to make them more comparable. The processed columns 5.-11. are then used to give rise to kNN-clusterings of the postal code areas with the values of k between 2 and 10. These will be visualized with Folium.

Finally, we will use Foursquare to get an estimate on the amount of coffee shops in each postal code area and try to see if this amount correlates to the education profile of the postal codes.


# Week two - implementation and report

### The report (To be moved further down)


#### Introduction
Introduction where you discuss the business problem and who would be interested in this project.
#### Data
Data where you describe the data that will be used to solve the problem and the source of the data.
#### Methodology
Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
#### Results
Results section where you discuss the results.
#### Discussion
Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
#### Conclusion
Conclusion section where you conclude the report.

### The implementation

In [3]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.cluster import KMeans

!conda install -c conda-forge utm
# needed for coordinate changes
import utm


!conda install -c conda-forge folium=0.5.0 --yes
import folium 

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - utm


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    utm-0.5.0                  |             py_0           8 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.4 MB

The following NEW packages will be INSTALLED:

    python_abi:      3.6-1_cp36m       conda-forge
    utm:             0.5.0-py_0        conda-forge

The following packages will be UPDATED:

    ca-ce

In [4]:
# Here we import the data
data_df = pd.read_csv('https://luisto.fi/documents/KoulutusasteetAE.csv', sep=';')

In [5]:
# Here we translate the data and wrangle it a bit.
work_data = data_df.copy()

work_data.iloc[0,0] = "Whole Country"

new_column_names = [
    'postal_codes_full', # PC borough (region)
    'X', # X-coordinate in meters
    'Y', # Y-coordinate in meters
    'postal_code_area', # Land area of postal code area
    'adult_population', # total adult population in postal code
    'basic_ed', # number of people who have finished basic education
    'educated_total', # number of all people who have been educated
    'higher_ed', # number of people who have passed high school
    'professional_ed', # number of people who have passed trade school
    'bachelor_ed', # number of bachelor diplomas
    'master_ed', # number of master's diplomas
    'prof_sci_tech_action', # amount of professional, scientific and techincal action (?)
    'student_population' # student population
]
work_data.columns = new_column_names

work_data.replace("...","0", inplace=True) # There are a few missing values which we'll just assume to be 0
work_data.postal_code_area = work_data.postal_code_area/1000000 # We rescale the area slightly to prevent underflows later on



work_data = work_data.loc[work_data.X > 100000 ] # we remove a few data points (less than 1%) with bad coordinate values



In [6]:
location_data = work_data[['X', 'Y']].astype(float)
print(location_data.dtypes)

# A function that changes the X- and Y- coordinate pare, given in UTM, to latitude and longitude.
def utm_coord_to_latlon(row):
    x_coord = row.X
    y_coord = row.Y
    latitude, longitude = utm.to_latlon(x_coord, y_coord, 35, 'V') # 35 and V are parameters in the UTM system. They specify an area near Finland.
#    print(latitude,longitude)
    row.X = latitude
    row.Y = longitude
    return row

    
location_data = location_data.apply(utm_coord_to_latlon, axis = 1)

print(location_data.dtypes)
#print(x_coord, y_coord)
location_data.head()


#print(utm.to_latlon(x_coord, y_coord, 35, 'V'))

#location_data = work_data[['postal_code', 'X', 'Y']]
#location_data.columns = work_data['postal_code', 'latitude', 'longitude']

# Helsinki
# 60.1699° N, 
# 24.9384° E

work_data.head()

X    float64
Y    float64
dtype: object
X    float64
Y    float64
dtype: object


Unnamed: 0,postal_codes_full,X,Y,postal_code_area,adult_population,basic_ed,educated_total,higher_ed,professional_ed,bachelor_ed,master_ed,prof_sci_tech_action,student_population
0,Whole Country,429300,7084490,390813.6924,4459828,1060335,3399493,306488,2051978,542023,499004,136999,400807
1,00100 Helsinki Keskusta - Etu-Toolo (Helsinki),384979,6672361,2.353278,16273,1659,14614,2616,3027,2983,5988,7659,1212
2,00120 Punavuori (Helsinki),385531,6671434,0.41401,6202,679,5523,1062,1216,1040,2205,1430,402
3,00130 Kaartinkaupunki (Helsinki),386244,6671474,0.42896,1319,131,1188,245,234,190,519,2466,111
4,00140 Kaivopuisto - Ullanlinna (Helsinki),386394,6670766,0.931841,6800,713,6087,1144,1296,1167,2480,312,479


In [7]:
def split_postal_codes(row):
    
    string = row.postal_codes_full
    
    if string == 'Whole Country':
        code = '000000'
        region = 'Finland'
        borough = 'Finland'
    
    else:        
        code, names = string.split(' ',1)
        region, borough = names.split('(', 1)
        borough = borough[:len(borough)-1] # remove last parenthesis
    
    row['code'] = code
    row['region'] = region
    row['borough'] = borough
    return row

names_df = work_data.apply(split_postal_codes, axis = 1)[['code', 'region', 'borough']]

names_df.head()

Unnamed: 0,code,region,borough
0,0,Finland,Finland
1,100,Helsinki Keskusta - Etu-Toolo,Helsinki
2,120,Punavuori,Helsinki
3,130,Kaartinkaupunki,Helsinki
4,140,Kaivopuisto - Ullanlinna,Helsinki


In [8]:
# Data normalization

clusterable_data = work_data.drop(['postal_codes_full', 'X', 'Y'], axis = 1)
clusterable_data = clusterable_data.astype(float)

population_data = clusterable_data.iloc[:,1:].div(clusterable_data.postal_code_area, axis=0)

population_data.head()

# Mean Normalization
mean_normalized_cluster=(population_data-population_data.mean())/population_data.std()

# Min-Max Normalization
minmax_normalized_cluster=(population_data-population_data.min())/(population_data.max()-population_data.min())

mean_normalized_cluster.head()







Unnamed: 0,adult_population,basic_ed,educated_total,higher_ed,professional_ed,bachelor_ed,master_ed,prof_sci_tech_action,student_population
0,-0.259503,-0.281785,-0.250087,-0.193639,-0.309552,-0.224519,-0.191046,-0.100839,-0.266827
1,7.075836,3.845625,7.65911,7.628005,4.249038,7.476642,10.683497,19.272661,5.593823
2,15.645474,9.341402,16.752938,17.862335,10.123549,15.045314,22.576387,20.459826,10.792745
3,2.995537,1.497142,3.271016,3.822351,1.612949,2.460989,4.976834,34.121195,2.671929
4,7.482093,4.199309,8.070003,8.445028,4.620917,7.383988,11.183181,1.890356,5.582534


In [9]:
# Clustering and visualization of the clusters.


# Number of clusters
kclusters = 5

grouped_clustering = mean_normalized_cluster

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(grouped_clustering)

The following cell will contain the FourSquare CLIENT_ID and CLIENT_SECRET strings. It is a hidden cell so it should not appear in the GitHub version.

In [10]:
# The code was removed by Watson Studio for sharing.

In [11]:
# Further FourSquare parameters
VERSION = '20180604'
LIMIT = 100