# Homework 1
## Uzay Karadağ 090200738



### Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

#### Importing necessary libraries:
1. urlopen from urllib.request library has to be imported in order to make an HTTP GET request to the IBB Municipality URL to obtain the .kml data.
2. parse method from xmltodict library has to be imported in order to well, parse the .kml file into a ordered dictionary from the collections module of Python itself. It's quite self explanatory.
3. pandas has to be imported because we are going to be working with pandas DataFrame objects after we transform our raw data.

In [1]:
from urllib.request import urlopen
from xmltodict import parse
import pandas as pd

#### Making an HTTP GET request to the IBB DB:
Using urlopen we make a HTTP GET request to the IBB Municipality site, to get the public dataset.

In [2]:
with urlopen("https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml") as url:
    raw = parse(url, encoding='utf-8')

#### Evaluating the raw data:
This dataset seems like a bunch of information including coordinates in terms of latitude and longitudes for all of the sea stations located in Istanbul. As it can be seen on IBB's website the other information contains icons used in the map and other relevant map data, such as the color used for each respective station. 

* I first output the raw data to see what it looks like and get a grasp on its format however it was tiringly lengthy when I pushed the notebook to Github so I decided to save the eyesight of the readers by not outputting it there. 
*The same applies for other transformations of this dataset, it was really lengthy and ugly. Any version in the code cells were definitely checked out by myself and later on left out of the final submission.

#### Formatting the ordered dictionary:
As can be seen on the output of the above cell, the dataset is quite messy and we only need the station name, longtitude and latitude data. Upon investigating it is obvious that we have to go down the tree such that 'kml'->'Document'->'Folder'->'Folder' structure is accessed.

In [3]:
data = raw['kml']['Document']['Folder']['Folder']

#### Looping through the data to record the parts that is desired:
Now we have to loop through the dictionary while recording each station's coordinates in respective lists. Since longitude and latitude data can be found under both 'LookAt' and 'Camera' keys we will have to add a control structure which in this case is a if-elif-else statement.

In [4]:
station_name, longitude, latitude = [], [], []

for item in data:
    for station in item['Placemark']:
        if station.get('LookAt'):
            station_name.append(station['name'])
            longitude.append(station['LookAt']['longitude'])
            latitude.append(station['LookAt']['latitude'])
        elif station.get('Camera'):
            station_name.append(station['name'])
            longitude.append(station['Camera']['longitude'])
            latitude.append(station['Camera']['latitude'])
        else:
            print('ERROR: Cannot obtain coordinate data.')
            
coordinate_data = {'station_name': station_name, 'latitude': latitude, 'longitude': longitude}
coordinate_data = pd.DataFrame(coordinate_data).set_index('station_name')

coordinate_data

Unnamed: 0_level_0,latitude,longitude
station_name,Unnamed: 1_level_1,Unnamed: 2_level_1
MALTEPE,40.91681013544846,29.13060758098593
AHIRKAPI,41.00314456999032,28.98289668101853
BEŞİKTAŞ-1,41.04116198628195,29.00778819900819
BEŞİKTAŞ-2,41.04065414312002,29.0055048939288
BOSTANCI,40.95173395654253,29.09425745312653
EMİNÖNÜ-1,41.01495987953694,28.97621869809887
EMİNÖNÜ-2,41.01495987953694,28.97621869809887
EMİNÖNÜ-3,41.01488637107048,28.97495985342729
EMİNÖNÜ-4,41.01488637107048,28.97495985342729
HAYDARPAŞA,40.99577360085738,29.01810215560077


### Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

#### Making an HTTP GET request to get the data from the IBB URL:
Just like we did in Q1, we are going to do two different GET requests for the 2020 and 2021 from the IBB website.
Using ISO-8859-9 as the encoding wasn't by choice it was necessary because the Turkish characters didn't render correctly otherwise

In [5]:
with urlopen("https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/f1f95d5d-fa2f-479d-9d50-85ca1d604c1e/download/2020-yl-ehir-hatlar-sefer-saylar.csv") as url:
    trip_data_2020 = pd.read_csv(url, encoding='ISO-8859-9', sep=';')
trip_data_2020 = trip_data_2020.rename(columns={"YIL": "year", "GÜZERGAH": "route", "TOPLAM SEFER ADETİ": "trip_count"})
trip_data_2020

Unnamed: 0,year,route,trip_count
0,2020,BEŞİKTAŞ - KADIKÖY,26.879
1,2020,KADIKÖY - KARAKÖY - BEŞİKTAŞ,13.0
2,2020,EMİNÖNÜ - ÜSKÜDAR,28.441
3,2020,ÜSKÜDAR - KARAKÖY - EMİNÖNÜ,8.737
4,2020,KADIKÖY - EMİNÖNÜ,18.408
5,2020,KADIKÖY - KARAKÖY,25.658
6,2020,KABATAŞ - KADIKÖY - ADALAR - BOSTANCI,5.879
7,2020,İSTANBUL - ADALAR,4.542
8,2020,KADIKÖY - KARAKÖY - EMİNÖNÜ,11.156
9,2020,BOĞAZ GİDİŞ GELİŞ (EMİNÖNÜ - BEŞİKTAŞ -KUZGUN...,523.0


In [6]:
with urlopen("https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/d2c7e4c3-fd09-4952-8a8e-776e3accf91d/download/2021-yl-ehir-hatlar-sefer-saylar.csv") as url:
    trip_data_2021 = pd.read_csv(url, encoding="ISO-8859-9", sep=";").dropna()
    trip_data_2021 = trip_data_2021.rename(columns={"Yil": "year", "Guzergah": "route", "Toplam Sefer Adeti": "trip_count"})
trip_data_2021

Unnamed: 0,year,route,trip_count
0,2021.0,BEŞİKTAŞ-KADIKÖY,23.658
1,2021.0,EMİNÖNÜ-ÜSKÜDAR,23.854
2,2021.0,EMİNÖNÜ-KADIKÖY,18.298
3,2021.0,EMİNÖNÜ-BEŞİKTAŞ-KUZGUNCUK-BEYLERBEYİ-ÇENGELKÖ...,497.0
4,2021.0,EMİNÖNÜ-BEŞİKTAŞ-ORTAKÖY-EMİRGAN-PAŞABAHÇE-BEY...,545.0
5,2021.0,ÇENGELKÖY-BEŞİKTAŞ-EMİNÖNÜ,433.0
6,2021.0,KADIKÖY-KARAKÖY,6.168
7,2021.0,KADIKÖY-KARAKÖY-EMİNÖNÜ,18.304
8,2021.0,KABATAŞ-KADIKÖY-ADALAR,7.046
9,2021.0,BOSTANCI- BÜYÜKADA-HEYBELİADA,940.0


#### Evaluating the data:
The data is very easily intepreted since it only has three parameters one of which is constans for each DataFrame.
1. *year*, this one is pretty self-explanatory.
2. *route*, asides from minor changes in formatting this columns contains sequential station names for the routes of the trips.
3. *trip_count*, this one in particular is where this dataset really becomes important. This columns contains data about the counts of trips made for the respective route that year.

#### Calculating the number of total trips:
Next up, we are going to calculate the total number of trips made, each year. The challenge here is that since the Turkish language uses "." for readability as the numbers get large, Python as inspired by the US standard uses dot to represent decimals as opposed to the comma use in Turkey. Since our "trip_count" column only ever uses one dot for a maximum of 6 digit numbers I can get away with multiplying every element with 10e3 to offset the notation.

In [7]:
if trip_data_2020['trip_count'][0] != 26879 and trip_data_2021['trip_count'][0] != 23658:
    trip_data_2020['trip_count'] = trip_data_2020['trip_count'] * 1000
    trip_data_2021['trip_count'] = trip_data_2021['trip_count'] * 1000
    
total_trips_2020 = trip_data_2020['trip_count'].sum()
total_trips_2021 = trip_data_2021['trip_count'].sum()

print("Total trips made in 2020: %d"%total_trips_2020)
print("Total trips made in 2021: %d"%total_trips_2021)

Total trips made in 2020: 5851006
Total trips made in 2021: 8956095


#### Finding the busiest station:
Now for finding the busiest station we will have to use some programming skills. Our routes are in the form of stations seperated with dashes and spaces, we have to first parse individual stations every route has and increment the respective stations' trip count with the routes trip count. Then we will easily be able to find the busiest station a.k.a the one with the most trips made. Keep in mind that we will perform these operations for both years.

In [8]:
pd.options.mode.chained_assignment = None
if not isinstance(trip_data_2020['route'][0], list):
    for i, route in enumerate(trip_data_2020['route']):
        trip_data_2020['route'][i] = route.replace(" ", "").split("-")
trip_data_2020

Unnamed: 0,year,route,trip_count
0,2020,"[BEŞİKTAŞ, KADIKÖY]",26879.0
1,2020,"[KADIKÖY, KARAKÖY, BEŞİKTAŞ]",13000.0
2,2020,"[EMİNÖNÜ, ÜSKÜDAR]",28441.0
3,2020,"[ÜSKÜDAR, KARAKÖY, EMİNÖNÜ]",8737.0
4,2020,"[KADIKÖY, EMİNÖNÜ]",18408.0
5,2020,"[KADIKÖY, KARAKÖY]",25658.0
6,2020,"[KABATAŞ, KADIKÖY, ADALAR, BOSTANCI]",5879.0
7,2020,"[İSTANBUL, ADALAR]",4542.0
8,2020,"[KADIKÖY, KARAKÖY, EMİNÖNÜ]",11156.0
9,2020,"[BOĞAZGİDİŞGELİŞ(EMİNÖNÜ, BEŞİKTAŞ, KUZGUNCUK,...",523000.0


I transformed the routes to lists of stations from strings of stations seperated with dashes. This was done so we could record the total number of trips made to or from any said station using the pandas *isin* method.

In [20]:
stations_2020 = []
for station_list in trip_data_2020['route']:
    stations_2020.extend(station_list)
stations_2020

['BEŞİKTAŞ',
 'KADIKÖY',
 'KADIKÖY',
 'KARAKÖY',
 'BEŞİKTAŞ',
 'EMİNÖNÜ',
 'ÜSKÜDAR',
 'ÜSKÜDAR',
 'KARAKÖY',
 'EMİNÖNÜ',
 'KADIKÖY',
 'EMİNÖNÜ',
 'KADIKÖY',
 'KARAKÖY',
 'KABATAŞ',
 'KADIKÖY',
 'ADALAR',
 'BOSTANCI',
 'İSTANBUL',
 'ADALAR',
 'KADIKÖY',
 'KARAKÖY',
 'EMİNÖNÜ',
 'BOĞAZGİDİŞGELİŞ(EMİNÖNÜ',
 'BEŞİKTAŞ',
 'KUZGUNCUK',
 'BEYLERBEYİ',
 'ÇENGELKÖY',
 'ARNAVUTKÖY)',
 'BOĞAZGİDİŞGELİŞ(EMİNÖNÜ',
 'BEŞİKTAŞ',
 'ORTAKÖY',
 'EMİRGAN',
 'PAŞABAHÇE',
 'BEYKOZ)',
 'ÇENGELKÖY',
 'BEŞİKTAŞ',
 'EMİNÖNÜ',
 'ÜSKÜDAR',
 'KARAKÖY',
 'KASIMPAŞA',
 'FENER',
 'BALAT',
 'HASKÖY',
 'AYVANSARAY',
 'SÜTLÜCE',
 'EYÜP',
 'KADIKÖY',
 'EMİRGAN(LALESEFERLERİ)',
 'BOSTANCI',
 'BÜYÜKADA',
 'HEYBELİADA',
 'HEYBELİADA',
 'BÜYÜKADA',
 'BOSTANCI',
 'ADALAR',
 'BÜYÜKADA',
 'BOSTANCI',
 'BÜYÜKADA',
 'KINALIADA(RİNG)',
 'BEYKOZ',
 'MUHTELİFİSK.',
 'BEŞİKTAŞ',
 'EMİNÖNÜ',
 'BEYKOZ',
 'MUHTELİFİSK.',
 'EMİNÖNÜ',
 'A.HİSARI',
 'MUHTELİFİSK.',
 'EMİNÖNÜ',
 'ÇENGELKÖY',
 'BEYLERBEYİ',
 'BEŞİKTAŞ',
 'EMİNÖNÜ',
 'A.KAV

Now that we have all of the stations as a list we can create another DataFrame with two columns of *station_name*-*trip_count* and increment it according to our dataset. I am aware that the *iterrows* method is looked down upon in the data science community for being inefficient but it was the best solution I could come up with at the moment.

In [21]:
temp_dict = {"station_name": stations_2020}
station_data_2020 = pd.DataFrame(temp_dict)
station_data_2020["trip_count"] = 0
station_data_2020 = station_data_2020.drop_duplicates()

for i, station in station_data_2020.iterrows():
    for ii, item in trip_data_2020.iterrows():
        if station.station_name in item.route:
            station_data_2020.trip_count[i] = station_data_2020.trip_count[i] + item.trip_count
station_data_2020

Unnamed: 0,station_name,trip_count
0,BEŞİKTAŞ,3214590
1,KADIKÖY,659980
3,KARAKÖY,71881
5,EMİNÖNÜ,1859768
6,ÜSKÜDAR,604727
14,KABATAŞ,10351
16,ADALAR,17444
17,BOSTANCI,14018
18,İSTANBUL,4542
23,BOĞAZGİDİŞGELİŞ(EMİNÖNÜ,1136000


In [11]:
station_data_2020[station_data_2020.trip_count == max(station_data_2020.trip_count)]

Unnamed: 0,station_name,trip_count
0,BEŞİKTAŞ,3214590


As it seems from the results obtained **Beşiktaş** was the busiest station of 2020. Let's quickly do the same calculations for the 2021 dataset.

In [12]:
if not isinstance(trip_data_2021['route'][0], list):
    for i, route in enumerate(trip_data_2021['route']):
        trip_data_2021['route'][i] = route.split("-")

stations_2021 = []
for station_list in trip_data_2021['route']:
    stations_2021.extend(station_list)
stations_2021

temp_dict = {"station_name": stations_2021}
station_data_2021 = pd.DataFrame(temp_dict)
station_data_2021["trip_count"] = 0
station_data_2021 = station_data_2021.drop_duplicates()

for i, station in station_data_2021.iterrows():
    for ii, item in trip_data_2021.iterrows():
        if station.station_name in item.route:
            station_data_2021.trip_count[i] = station_data_2021.trip_count[i] + item.trip_count
station_data_2021

station_data_2021[station_data_2021.trip_count == max(station_data_2021.trip_count)]

Unnamed: 0,station_name,trip_count
0,BEŞİKTAŞ,4814193


What a surprise, our busyness champion of 2020 goes for the repeat in 2021, with a total trip count of 4,814,193, **Beşiktaş** station comes out on top for both years.

### Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.


In [13]:
with urlopen("https://data.ibb.gov.tr/dataset/20f33ff0-1ab3-4378-9998-486e28242f48/resource/6fbdd928-8c37-43a4-8e6a-ba0fa7f767fb/download/istanbul-deniz-iskeleleri-yolcu-saylar.csv") as url:
    passenger_data = pd.read_csv(url, encoding='ISO-8859-9', sep=';')
passenger_data = passenger_data.drop(columns="Otorite Adi").rename(columns={"Yil": "year", "Ay": "month", "Istasyon Adi": "station_name", "Yolcu Sayisi": "passenger_count"})
passenger_data

Unnamed: 0,year,month,station_name,passenger_count
0,2021,3,BEYKOZ,5076
1,2021,3,YENIKOY,5347
2,2021,3,BESIKTAS,106334
3,2021,3,KABATAS,24
4,2021,3,USKUDAR,94200
...,...,...,...,...
656,2021,11,Eminönü,55387
657,2021,11,Kadıköy Balon,40680
658,2021,11,Kadıköy Çayırbaşı,69443
659,2021,11,Karaköy,55098


#### Evaluating the data:
This dataset seems to be similiar to the previous one with a caveat. Whereas the last dataset recorded the number of trips made to or from stations this one records to number of passengers passing by the respective station. There was also a *authority* parameter but I decided to drop it from the DataFrame since I can't see how it would be relevant in this case.

In [14]:
total_passenger_count_2020 = passenger_data[passenger_data.year == 2020].passenger_count.sum()
total_passenger_count_2021 = passenger_data[passenger_data.year == 2021].passenger_count.sum()
print("Total passenger count in 2020: %d"%total_passenger_count_2020)
print("Total passenger count in 2021: %d"%total_passenger_count_2021)

Total passenger count in 2020: 0
Total passenger count in 2021: 33162030


After calculating the total passenger count for both years it's apparent that there is no data concerning the year 2020 in this dataset. I went and checked the .csv file in place was unable to find any data points for 2020 in there.
Carrying on I will only perform calculations for 2021 because of this reason.

In [15]:
station_data = passenger_data.groupby(['station_name']).sum().drop(columns={'year', 'month'})
station_data[station_data.passenger_count == station_data.passenger_count.max()]

Unnamed: 0_level_0,passenger_count
station_name,Unnamed: 1_level_1
USKUDAR,6083839


It seems that, even though **Beşiktaş** took the cake in number of trips made, **Üsküdar** was the busiest station of 2021 in terms of Istanbulite traffic.

In [16]:
for i in range(3, 12):
    temp = passenger_data.loc[passenger_data.month == i]
    temp = temp.groupby(['station_name']).sum()
    print(temp.index[temp.passenger_count == temp.passenger_count.max()], "\n\n")

Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 


Index(['USKUDAR'], dtype='object', name='station_name') 




**Üsküdar** once again seems to be the busiest station in terms of human traffic in a monthly basis throughout the 3rd and 11th months of 2021 according to the dataset.

##### Cross evaluation of the results from Q2 and Q3:
Even though **Beşiktaş Sea Station** has had the most trips in both 2020 and 2021, **Üsküdar Sea Station** was the most populated throughout 2021. This means that even though **Üsküdar** had slightly less trips made to or from the vehicles were more crowded compared to **Beşiktaş**.