# Some ground rules for the assignments:

For all assignments (this one and any future assignment including the final project): 

* Do not download and save the data locally unless your data is very big (~TBs). I **do not** want to see you opening a local file for the data I gave you as an URL. Anything local is suspect: local files can't be trusted (they might be manipulated, changed, modified, tempered.) Refer to my lecture notes on how to pull data from an URL using `urlopen`.

* All computations must be done locally within python. Nothing external: no manual input, no excel, no SQL, no java, etc.

* All code has to be explained. Explain your reasoning and your choices. If you installed a third party library (including `numpy`, `scipy`, `pandas` etc), explain which part you import and what that function does etc. 

* Explain your code using a markdown cell. **Do not** use code comments starting with `#` to do your explanations.

* Do not use `if __name__ == "__main__"`. EVER! If you are using that within jupyter, I am going to assume you found the solution on the internet and you cut/pasted without understanding what that peice of code did.

# Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

Below, I am installing certain libraries and importing commands I need. 
* **urlopen** to obtain my data from its website and assign it.
* **parse** function that I have imported from xmltodict helps me convert my XML to a dictionary.
* I have imported **pandas** in order to create a pandas data frame later on in my code 

In [25]:
from urllib.request import urlopen
from xmltodict import parse
import pandas

# Q1.1 
## Understanding the data
**This data set contains information about docks that are distributed all over Istanbul. These informations are longitude, latitude, altitude, heading, tilt, range and coordinates.**

# Q1.2 
## Extracting data
Here I am adding the data as a dictionary into my space so I can work on it. I have used encoding='utf-8' so that my data can be read properly.

In [26]:
with urlopen("https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml") as url:
    data = parse(url,encoding='utf-8')
data

OrderedDict([('kml',
              OrderedDict([('@xmlns', 'http://www.opengis.net/kml/2.2'),
                           ('@xmlns:gx', 'http://www.google.com/kml/ext/2.2'),
                           ('@xmlns:kml', 'http://www.opengis.net/kml/2.2'),
                           ('@xmlns:atom', 'http://www.w3.org/2005/Atom'),
                           ('Document',
                            OrderedDict([('name', 'SHI İSKELELER.kml'),
                                         ('StyleMap',
                                          [OrderedDict([('@id',
                                                         'msn_marina23'),
                                                        ('Pair',
                                                         [OrderedDict([('key',
                                                                        'normal'),
                                                                       ('styleUrl',
                                                            

By looking up data.keys() I have discovered that this data tree had only one root and it is kml.
Then I continue checking the keys ,and basically going down the tree, until I reach a list type which is data['kml']['Document']['Folder']['Folder']

In [27]:
data.keys() 

odict_keys(['kml'])

In [28]:
data['kml']
data['kml'].keys()

odict_keys(['@xmlns', '@xmlns:gx', '@xmlns:kml', '@xmlns:atom', 'Document'])

In [29]:
data['kml']['Document']
data['kml']['Document'].keys()

odict_keys(['name', 'StyleMap', 'Style', 'Folder'])

In [30]:
data['kml']['Document']['Folder'].keys()

odict_keys(['name', 'Folder'])

In [31]:
type(data['kml']['Document']['Folder']['Folder'])

list

By reaching this list I got closer to the values I am looking for which are longitude and latitude.

In [32]:
mylist = data['kml']['Document']['Folder']['Folder']
range(len(mylist))
mylist;

By creating this for loop I am checking to see if all the elements (from 0 to 4) have one same key. I have discovered that all of them have 'Placemark'.

In [33]:
for i in range(len(mylist)):
    print(mylist[i].keys())

odict_keys(['name', 'open', 'Placemark'])
odict_keys(['name', 'Placemark'])
odict_keys(['name', 'Placemark'])
odict_keys(['name', 'LookAt', 'Style', 'Placemark'])
odict_keys(['name', 'Placemark'])


Once again I am examining keys in mylist[i]['Placemark'][n] elements in a for loop. I can see that not all elements have 'LookAt' so I am picking up a random element (mylist[0]['Placemark'][0]) which has it and examine its keys. Finally I have reached longitude and latitude values so I can start collecting them from each element.

In [34]:
pm = mylist[0]['Placemark']
for i in range(len(mylist)):
    for n in range(len(pm)):
        print(mylist[i]['Placemark'][n].keys())

odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'Camera', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys(['name', 'LookAt', 'styleUrl', 'Point'])
odict_keys([

IndexError: list index out of range

In [35]:
mylist[0]['Placemark'][0]['LookAt'].keys()

odict_keys(['gx:TimeStamp', 'gx:ViewerOptions', 'longitude', 'latitude', 'altitude', 'heading', 'tilt', 'range', 'gx:altitudeMode'])

In order to collect longitude and latitude values from each element I created two for loops, these for loops travel between datas and collect the name, longitude, latitude and add them to an empty dictionary y. If the element does not have longitude and latitude (remember the ones that don't have a 'LookAt') I am assigning those values to zero. Then this y gets appended to an x list so that x starts forming with the datas I want to collect.

x is now a list that contains the name, longitude and latitude

In [36]:
x=[]
for i in range(len(mylist)):     
    for z in range(len(mylist[i]['Placemark'])):
        y={}   
        try:
          lg = mylist[i]['Placemark'][z]['LookAt']['longitude']
          lt = mylist[i]['Placemark'][z]['LookAt']['latitude']
          
          y = {'name':mylist[i]['Placemark'][z]['name'],'longitude':lg,'latitude':lt } 
          x.append(y)   
        except KeyError:          
          y = {'name':mylist[i]['Placemark'][z]['name'],'longitude':0,'latitude':0 }         
          x.append(y)   
            
x

[{'name': 'MALTEPE',
  'longitude': '29.13060758098593',
  'latitude': '40.91681013544846'},
 {'name': 'AHIRKAPI', 'longitude': 0, 'latitude': 0},
 {'name': 'BEŞİKTAŞ-1',
  'longitude': '29.00778819900819',
  'latitude': '41.04116198628195'},
 {'name': 'BEŞİKTAŞ-2',
  'longitude': '29.0055048939288',
  'latitude': '41.04065414312002'},
 {'name': 'BOSTANCI',
  'longitude': '29.09425745312653',
  'latitude': '40.95173395654253'},
 {'name': 'EMİNÖNÜ-1', 'longitude': 0, 'latitude': 0},
 {'name': 'EMİNÖNÜ-2', 'longitude': 0, 'latitude': 0},
 {'name': 'EMİNÖNÜ-3', 'longitude': 0, 'latitude': 0},
 {'name': 'EMİNÖNÜ-4', 'longitude': 0, 'latitude': 0},
 {'name': 'HAYDARPAŞA', 'longitude': 0, 'latitude': 0},
 {'name': 'KABATAŞ', 'longitude': 0, 'latitude': 0},
 {'name': 'KADIKÖY-1', 'longitude': 0, 'latitude': 0},
 {'name': 'KADIKÖY-2', 'longitude': 0, 'latitude': 0},
 {'name': 'KARAKÖY', 'longitude': 0, 'latitude': 0},
 {'name': 'KARAKÖY-2',
  'longitude': '28.97517820592657',
  'latitude': '41

Now that I have the datas I want in a list x, I covert it to a pandas data frame and set the index to the station names like the question asked. The stations that don't have longitude and latitude values have been assigned to 0 in the previous steps.

In [37]:
result = pandas.DataFrame(x)
result = result.set_index('name')
result

Unnamed: 0_level_0,longitude,latitude
name,Unnamed: 1_level_1,Unnamed: 2_level_1
MALTEPE,29.13060758098593,40.91681013544846
AHIRKAPI,0.0,0.0
BEŞİKTAŞ-1,29.00778819900819,41.04116198628195
BEŞİKTAŞ-2,29.0055048939288,41.04065414312002
BOSTANCI,29.09425745312653,40.95173395654253
EMİNÖNÜ-1,0.0,0.0
EMİNÖNÜ-2,0.0,0.0
EMİNÖNÜ-3,0.0,0.0
EMİNÖNÜ-4,0.0,0.0
HAYDARPAŞA,0.0,0.0


# Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

The data I want to work on is in a csv file so I begin by importing pandas as pd and csv. This helps me read my data with pd.read_csv

In [38]:
import pandas as pd
import csv

# Q2.1 
## Understanding the data 
**This data set contains the routes in Istanbul Metropolitan Municipality City Lines and the number of trips made through them in 2020 and 2021**

# Q2.2
## Extracting the data
I start by adding both of my data sets to my directory, for 2020 and for 2021 by using pd.read_csv.
* 2020 Data

While reading the data for 2020 I have noticed some typo mistakes so I fixed them by using **.rename**.Also the data set was prepared by using "." (dot) for thousands in the number of trips section, I specifided the thousands and decimals so that the reading numbers process wouldn't be complicated. In order to extract the routes and number of trips in total, I used **.loc** and picked all rows but only 2 columns which are the columns I want to extract. And the final result is the extracted version of the data that belongs to 2020.

* 2021 Data

Almost all the process was same as the 2020 data except one important part which is the seperators. The 2020 data was seperated by "," which is the expected type. But in the 2021 data when viewing in on WordPad I've noticed that datas were seperated by ";". So in the reading process I explained that to the program by using **sep=";"**. Then I renamed the columns so that they would be the same with 2020 data and finally extracted the wanted data which is the route and number of trips.


In [39]:
data2020 = pd.read_csv(r'C:\Users\sinem\Downloads\2020 IMM Sehir Hatlari Number of Expeditions.csv',thousands='.', decimal=',')
data2020.rename(columns={'TOPLAM SEFER ADETY': 'TOPLAM SEFER ADETI'}, inplace=True)
data20extracted = data2020.loc[:,["GUZERGAH", "TOPLAM SEFER ADETI"]]
data20extracted 

Unnamed: 0,GUZERGAH,TOPLAM SEFER ADETI
0,BEÞÝKTAÞ - KADIKÖY,26879
1,KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13
2,EMÝNÖNÜ - ÜSKÜDAR,28441
3,ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8737
4,KADIKÖY - EMÝNÖNÜ,18408
5,KADIKÖY - KARAKÖY,25658
6,KABATAÞ - KADIKÖY - ADALAR - BOSTANCI,5879
7,ÝSTANBUL - ADALAR,4542
8,KADIKÖY - KARAKÖY - EMÝNÖNÜ,11156
9,BOÐAZ GÝDÝÞ GELÝÞ (EMÝNÖNÜ - BEÞÝKTAÞ -KUZGUN...,523


In [40]:
data2021 = pd.read_csv(r'C:\Users\sinem\Downloads\2021-yl-ehir-hatlar-sefer-saylar.csv',sep=';',thousands='.', decimal=',')
data2021.rename(columns={'Guzergah': 'GUZERGAH', 'Yil':'YIL','Toplam Sefer Adeti':'TOPLAM SEFER ADETI'}, inplace=True)
data21extracted = data2021.loc[:,['GUZERGAH','TOPLAM SEFER ADETI']]
data21extracted

Unnamed: 0,GUZERGAH,TOPLAM SEFER ADETI
0,BEŞİKTAŞ-KADIKÖY,23658
1,EMİNÖNÜ-ÜSKÜDAR,23854
2,EMİNÖNÜ-KADIKÖY,18298
3,EMİNÖNÜ-BEŞİKTAŞ-KUZGUNCUK-BEYLERBEYİ-ÇENGELKÖ...,497
4,EMİNÖNÜ-BEŞİKTAŞ-ORTAKÖY-EMİRGAN-PAŞABAHÇE-BEY...,545
5,ÇENGELKÖY-BEŞİKTAŞ-EMİNÖNÜ,433
6,KADIKÖY-KARAKÖY,6168
7,KADIKÖY-KARAKÖY-EMİNÖNÜ,18304
8,KABATAŞ-KADIKÖY-ADALAR,7046
9,BOSTANCI- BÜYÜKADA-HEYBELİADA,940


# Q2.3 - Q2.4
## Calculating the #of trips
When calculating the total number of trips done in a certain year, I used the **.sum()** command on the 'TOPLAM SEFER ADETI' column that is in the extracted data. And then I printed it with the year included.

In [41]:
trips2020 = data20extracted.sum()['TOPLAM SEFER ADETI']
print("Total number of trips in 2020 is",trips2020)

Total number of trips in 2020 is 193669


In [42]:
trips2021 = data21extracted.sum()['TOPLAM SEFER ADETI']
print("Total number of trips in 2021 is",trips2021)

Total number of trips in 2021 is 177882


# Q2.5
## Finding the busiest station
To find the busiest station in a certain year, the **.max()** command found the maximum value on 'TOPLAM SEFER ADETI'. Now we know the maximum trip number, but what station(or route) does this number belong to? data20extracted['TOPLAM SEFER ADETI'] ==busiest20 showed me a series with true false values and the station with that number is labeles as True. With data20extracted[data20extracted['TOPLAM SEFER ADETI'] ==busiest20] I found a data frame that's
* [0] element is the route
* [1] element is the number of trips

by using **.iloc** I have obtained the [0] and[1] element and printed them. I did this process for each year.

In [43]:
busiest20 = data20extracted['TOPLAM SEFER ADETI'].max()
a = data20extracted[data20extracted['TOPLAM SEFER ADETI'] ==busiest20]
st = a.iloc[0,0]
no = a.iloc[0,1]
print("The busiest station in 2020 is",st,"with",no,"trips")

The busiest station in 2020 is EMÝNÖNÜ - ÜSKÜDAR with 28441 trips


In [44]:
busiest21 = data21extracted['TOPLAM SEFER ADETI'].max()
b = data21extracted[data21extracted['TOPLAM SEFER ADETI'] ==busiest21]
st = b.iloc[0,0]
no = b.iloc[0,1]
print("The busiest station in 2021 is",st,"with",no,"trips")

The busiest station in 2021 is EMİNÖNÜ-ÜSKÜDAR with 23854 trips


*Side note: I have realised that this data set contains the _route_ but I have printed it out as a _station_ like the question asked me to.*

# Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.


Just like the second question the data I am going to work on is in a csv file so I import pandas as pd and csv. This helps me read my data with pd.read_csv

In [45]:
import pandas as pd
import csv

# Q3.1
## Understanding the data
**This data set contains year, month, dock names in Istanbul, names of establishments that is responsible of said docks and number of passengers. The data belongs to the year 2021 and the months are explained it in numbers.**

# Q3.2
## Finding the busiest station
I begin by reading my data and declaring that they are seperated by ";", that is just how the original data set was created. When I ran this I realized that certain letters looked weird so I used **encoding="ISO-8859-1** 

In [46]:
data = pd.read_csv(r'C:\Users\sinem\Downloads\istanbul-deniz-iskeleleri-yolcu-saylar.csv',sep=';',encoding="ISO-8859-1")
data

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
0,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,BEYKOZ,5076
1,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,YENIKOY,5347
2,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,106334
3,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,KABATAS,24
4,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,USKUDAR,94200
...,...,...,...,...,...
656,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Eminönü,55387
657,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Balon,40680
658,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Çayýrbaþý,69443
659,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Karaköy,55098


To find the busiest station I found the maximum amount of passengers in the data set and the next step was to discover which station has that number. Then I printed my result and the variables that could change in the future was written with **.iloc**. (for example in this data set the year is set, which is 2021 but maybe later on the station or the passenger number could change.)

In [47]:
busiest = data['Yolcu Sayisi'].max()
a = data[data['Yolcu Sayisi'] == busiest]
print("The busiest station in 2021 is",a.iloc[0,3],"with",busiest,"trips")

The busiest station in 2021 is BESIKTAS with 757374 trips


# Q3.3
## Finding the busiest stations for each month
'Ay' from the original data had a lot of recurring elements, too many 3's too many 4's etc. In order to obtain a clean month series I used **.drop_duplicates()** when I did this on of each month number was left but the order was messed up. Instead of going like 0,1,2,3 the elements were going like 0,70,140... by using **.reset_index(drop=True,inplace=True)** that problem was solved

In [48]:
months = data['Ay']
months = months.drop_duplicates()
months.reset_index(drop=True,inplace=True)

Now that I have a month list I created a for loop that would pick these months and search them in my original data's month section, basically what I am telling my program to do is: When the months are the same, keep going and add the passenger number to an list x as you go. Once you get to a different month, stop and find the maximum number of passenger in that list **x** and add that to list **y**. After that the process for a new month will begin with the list x emptied. 

This code works because the months in the data set are in order. But if the month 3's were not serial and instead in all different places, I would have to write a new code.

In [49]:
station =  data['Istasyon Adi']
no = data['Yolcu Sayisi']
i = 0
y=[]
for k in months:
    x=[]
    for n in data['Ay']:
        if n == k:
            z = no[i]
            i = i + 1
            x.append(z)
    else:
            busiest = max(x)
            y.append(busiest)

In [50]:
y

[106334, 274984, 205662, 488048, 590792, 601089, 633355, 757374, 223286]

Now I know the busiest station for each month by finding the maximum passenger number. What I am telling the program to do is go to the original data set, find the data frame that has that exact number and that exact month. I could have just go by using the passenger number but I wanted to get a certain result so month was added too. 

Then I print them.

In [51]:
for s in range(len(months)):
    l = data[(data["Ay"]==months[s]) & (data["Yolcu Sayisi"]==y[s])]
    print("The busiest station in", l.iloc[0,0] ,"at month",l.iloc[0,1],"is",l.iloc[0,3],"with",l.iloc[0,4],"passengers!")

The busiest station in 2021 at month 3 is BESIKTAS with 106334 passengers!
The busiest station in 2021 at month 4 is BESIKTAS with 274984 passengers!
The busiest station in 2021 at month 5 is BESIKTAS with 205662 passengers!
The busiest station in 2021 at month 6 is BESIKTAS with 488048 passengers!
The busiest station in 2021 at month 7 is BESIKTAS with 590792 passengers!
The busiest station in 2021 at month 8 is BESIKTAS with 601089 passengers!
The busiest station in 2021 at month 9 is BESIKTAS with 633355 passengers!
The busiest station in 2021 at month 10 is BESIKTAS with 757374 passengers!
The busiest station in 2021 at month 11 is BESIKTAS with 223286 passengers!


# Q3.4
## Comparing results
The busiest station (route) in Q2 ended up being Eminonu-Uskudar. But in Q3 the busiest station ended up being Besiktas for each month. So the results don't agree.