<div style="float:left">
    <h1 style="width:450px">Practical 4: Object-Oriented Programming</h1>
    <h2 style="width:450px">Getting to grips with Functions &amp; Packages</h2>
</div>
<div style="float:right"><img width="100" src="https://github.com/jreades/i2p/raw/master/img/casa_logo.jpg" /></div>

<div style="border: dotted 1px rgb(156,121,26); padding: 10px; margin: 5px; background-color: rgb(255,236,184); color: rgb(156,121,26)"><i>Note</i>: You should download this notebook from GitHub and then save it to your own copy of the repository. I'd suggest adding it (<tt>git add ...</tt>) right away and then committing (<tt>git commit -m "Some message"</tt>). Do this again at the end of the class and you'll have a record of everything you did, then you can <tt>git push</tt> it to GitHub.</div>

## Revisiting Task 4. Why 'Obvious' is Not Always 'Right'

Task 4 in Practical 3 is hard, especially coming at the end of an already challenging practical. So I want to provide _another_ chance for the concepts to bed in before we use them as part of our exploratory work with the InsideAirbnb sample.

##### A Dictionary of Lists to the Rescue

Remember, if we don't really care about column order (and why would we, on one level?), then a dictionary of lists would be a nice way to handle things. And why should we care about column order? With our CSV files above we already saw what a pain it was to fix things when the layout of the columns changed from one data set to the next. If, instead, we can just reference the 'Description' column in the data set then it doesn't matter where that column actually is. Why is that? 

Well, here are four rows of 'data' for city sizes organised by _row_:

In [49]:
myData = [
    ['id', 'Name', 'Rank', 'Longitude', 'Latitude', 'Population'], 
    ['1', 'Greater London', '1', '-18162.92767', '6711153.709', '9787426'], 
    ['2', 'Greater Manchester', '2', '-251761.802', '7073067.458', '2553379'], 
    ['3', 'West Midlands', '3', '-210635.2396', '6878950.083', '2440986']
]

# What cities are in the data set?
col = myData[0].index('Name')
cities = []
for i in range(1,len(myData)):
    cities.append(myData[i][col])
print(", ".join(cities))

Greater London, Greater Manchester, West Midlands


Compare that code to how it works as a dictionary of lists organised by _column_:

In [50]:
myData = {
    'id'         : [0, 1, 2, 3, 4, 5],
    'Name'       : ['Greater London', 'Greater Manchester', 'Birmingham','Edinburgh','Inverness','Lerwick'],
    'Rank'       : [1, 2, 3, 4, 5, 6],
    'Longitude'  : [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
    'Latitude'   : [51.507, 53.479, 52.480, 55.953, 57.478, 60.155],
    'Population' : [9787426, 2705000, 1141816, 901455, 70000, 6958],
}

# What cities are in the data set?
print(", ".join(myData['Name']))   

#Python join() 方法用于将序列中的元素以
#指定的字符连接生成一个新的字符串。str.join(sequence)

Greater London, Greater Manchester, Birmingham, Edinburgh, Inverness, Lerwick


In [51]:
print(myData['Name'])

['Greater London', 'Greater Manchester', 'Birmingham', 'Edinburgh', 'Inverness', 'Lerwick']


**See how even basic questions like "Is Edinburgh in our list of data?" are suddenly easy to answer?** <br> We no longer need to loop over the entire data set in order to find one data point. In addition, we know that everything in the 'Name' column will be a string, and that everything in the 'Longitude' column is a float, while the 'Population' column contains integers. So that's made life easier already. But let's test this out and see how it works.

Now let's look at what you can do with this... but first we need to import one _more_ package that you're going to see a _lot_ over the rest of term: `numpy` (Numerical Python), which is used _so_ much that most people simply refer to it as `np`. This is a _huge_ package in terms of features, but right now we're interested only in the basic arithmatic functions: `mean`, `max`, and `min`.

There's a _lot_ of content to process in the code below, so do _not_ rush blindly on if it is confusing. 

<div style="border: dotted 1px rgb(156,121,26); padding: 10px; margin: 5px; background-color: rgb(255,236,184)"><i>Stop!</i>: Look closely at what is going on. Try pulling it apart into pieces and then reassembling it. Start with the bits that you understand and then <i>add</i> complexity.</div>

We'll go through each one in turn, but they nearly all work in the same way and the really key thing is that you'll notice that we no longer have any loops (which are slow) just `index` or `np.<function>` (which is _very_ fast). 

#### The Population of Manchester

The code can look pretty daunting, so let's break it down into parts:

In [16]:
myData = {
    'id'         : [0, 1, 2, 3, 4, 5],
    'Name'       : ['Greater London', 'Greater Manchester', 'Birmingham','Edinburgh','Inverness','Lerwick'],
    'Rank'       : [1, 2, 3, 4, 5, 6],
    'Longitude'  : [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
    'Latitude'   : [51.507, 53.479, 52.480, 55.953, 57.478, 60.155],
    'Population' : [9787426, 2705000, 1141816, 901455, 70000, 6958],
}


city = 'Greater Manchester'
pop = myData['Population'][ myData['Name'].index(city) ]
print(f"Manchester's population is {pop}") # Notice how 'f-strings' work!

Manchester's population is 2705000


Remember that this is a dictionary-of-lists (DoL). So:
```python
myData['Population']    # Returns a list of population values
myData['Population'][0] # Returns the first element of that list
```
Does **that part** make sense?

---

Now, to the second part, we know that Manchester is at index position 1 in the list, but we don't want to hard-code this for every city, so we need to replace `0` with code that will look up the index of a city, and we can only get that my looking in `myData['Name']`:
```python
myData['Name'].index('Manchester')
```

Here we look in the dictionary for the key `Name` and find that that's _also_ a list (`['London','Manchester',...]`). All we're doing here is ask Python to find the index of 'Manchester' for us in that list. 

And `myData['Name'].index('Manchester')` gives us back a `1`, so _instead_ of just writing in a `1` into `myData['Population'][1]` we can replace `1` with `myData['Name'].index('Manchester')`! Notice the complete _absence_ of a for loop?

#### The Easternmost City

Because we are dealing with numeric values now we can also do useful things much more quickly like finding the first part of, say, a _bounding box_ (the East value).

In [20]:
myData = {
    'id'         : [0, 1, 2, 3, 4, 5],
    'Name'       : ['Greater London', 'Greater Manchester', 'Birmingham','Edinburgh','Inverness','Lerwick'],
    'Rank'       : [1, 2, 3, 4, 5, 6],
    'Longitude'  : [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
    'Latitude'   : [51.507, 53.479, 52.480, 55.953, 57.478, 60.155],
    'Population' : [9787426, 2705000, 1141816, 901455, 70000, 6958],
}


city = myData['Name'][ myData['Longitude'].index( max(myData['Longitude']) ) ]
print(f"The easternmost city is: {city}")

The easternmost city is: Greater London


Again, we need to break this down into parts in order to understand it:
1. We need to find the maximum _value_ of the longitude in the data set.
2. We need to find the _index_ of this value.
3. We use that index value to look up the name of the city.

So, the pieces in code... 
```python
myData['Longitude']      # The longitude data values
myData['Longitude'][0]   # The first element of the list
max(myData['Longitude']) # The maximum value in the list
myData['Longitude'].index(...) # Search for a value in the list
```
Does **that part** make sense?

---
We can then think our way through this as we might with a maths equation and substitution:
```python
myData['Longitude'].index( max(myData['Longitude']) ) # The index of the maximum value of the list
myData['Name'][0] # The first element in the city list

# You can use whitespace to make this more legible
myData['Name'][
    myData['Longitude'].index( 
        max(myData['Longitude']) 
    )
]

# Or, once you're more comfortable with code:
city = myData['Name'][ myData['Longitude'].index( max(myData['Longitude']) ) ]
```
Does **that part** make sense?

---

So to explain this in three steps, what we're doing is:
* Finding the maximum value in the Longitude column (we know there must be one, but we don't know what it is!),
* Finding the index (position) of that maximum value in the Longitude column (now that we know what the value is!),
* Using that index to read a value out of the Name column.

I _am_ a geek, but that's pretty cool, right? In one line of code we managed to quickly find out where the data we needed was even though it involved three discrete steps. Think about how much work you'd have to do if you were still thinking in _rows_, not _columns_!

#### The Location of Lerwick

This is 'just' variations on a theme since we're still using the same concepts of `index` and lists, what makes it hard is that there looks to be a _lot_ of code on one line:

In [None]:
city = "Lerwick"
print(f"The town of {city} can be found at " + 
      f"{abs(myData['Longitude'][myData['Name'].index(city)])}ºW, {myData['Latitude'][myData['Name'].index(city)]}ºN")

But always remember that you can rewrite this using whitespace and concatentation to make it easier for a human to read:

In [None]:
city = "Lerwick"
print(f"The town of {city} can be found at " + 
      f"{abs(myData['Longitude'][myData['Name'].index(city)])}ºW, " +
      f"{myData['Latitude'][myData['Name'].index(city)]}ºN")

Or, you could even work it out this way first and _then_ combine the code as above:

In [52]:
city = "Lerwick"
lat  = abs(
    myData['Longitude'][
        myData['Name'].index(city)
    ]
)
lon  = myData['Latitude'][
    myData['Name'].index(city)
]

print(f"The town of {city} can be found at " + 
      f"{lat}ºW, " +
      f"{lon}ºN")

The town of Lerwick can be found at 1.145ºW, 60.155ºN


Notice that have a `+` at the *end of the line* tells Python that it should carry on reading to the next line as part of the same command. That's a handy way to make your code a little easier to read! Same goes withg formatting a list: if it's getting a little long then you can *also* continue a line using a `,`!

##### Recap of f-Strings

In case you're rusty on how f-strings (`f"<some text here>"`) work, the first one will help you to make sense of the second: f-strings allow you to 'interpolate' code directly into a string rather than having to have lots of `str(x) + " some text " + str(y)`. You can write `f"{x} some text {y}"` and Python will automatically replace `{x}` with the _value of `x`_ and `{y}` with the _value of `y`_. 

So `f"The town of {city} can be found at "` becomes `f"The town of Lerwick can be found at "` because `{city}` is replaced by the value of the variable `city`. This makes for code that is easier for humans to read and so I'd consider that a good thing.

##### Breaking it Down (Again)

The second f-string _looks_ hard because there's a _lot_ of code there. But, again, if we start with what we recognise that it gets just a little bit more manageable... Also, it stands to reason that the only difference between the two outputs is that one asks for the 'Longitude' and the other for the 'Latitude'. So if you can make sense of one you have _automatically_ made sense of the other and don't need to work it all out.

Let's start with a part that you might recognise:
```python
myData['Name'].index(city)
```
Does **that part** make sense?

---

You've _got_ this. This is just asking Python to work out the index of Lerwick (because `city = 'Lerwick'`). So it's a number. 5 in this case. And we can then think, 'OK so what does this return:
```python 
myData['Longitude'][5]
```
And the answer is `-1.145`. That's the Longitude of Lerwick! There's just _one_ last thing: notice that we're talking about degrees West here. So the answer isn't a negative (because negative West degrees would be _East_!), it's the _absolute_ value. And that is the final piece of the puzzle: `abs(...)` gives us the absolute value of a number!

Does **that part** make sense?

---

You could have [found `abs` yourself using Google](https://lmgtfy.app/?q=absolute+value+python).

#### The Average City Size

Here we're going to 'cheat' a little bit: rather than writing our own function, we're going to import a package and use someone _else's_ function. The `numpy` package contains a _lot_ of useful functions that we can call on (if you don't believe me, add "`dir(np)`" on a new line after the `import` statement), and one of them calculates the average of a list or array of data.

In [22]:
import numpy as np

mean = np.mean(myData["Population"])
print(f"The mean population is: {mean}")

The mean population is: 2435442.5


You _could_ also write this like:
```python
print(f"The mean population is: {np.mean(myData['Population']}")
```
There's no 'right' way here to write your code: putting it all on one line and not saving it to a temporary variable called 'mean' is slightly faster, but if you were going to use the mean to do other things (e.g. standardise the data) then it is a bit more clear what you're doing.

#### Standardising City Sizes

To give you a sense of how scaleable this approach to data is, check out this neat little trick for working out z-scores for cities sizes:

In [23]:
# Use numpy functions to calculate mean and standard deviation
mean = np.mean(myData['Population'])
std  = np.std(myData['Population'])
print(f"City distribution has sample mean {mean} and sample standard deviation of {std:7.2f}.") 
#7可不需要，小数点左边的值都会保留

City distribution has sample mean 2435442.5 and sample standard deviation of 3406947.93.


`numpy` gives us a way to calculate the mean and standard deviation _quickly_ and without having to reinvent the wheel. The other potentially new thing here is `{std:7.2f}`. This is about [string formatting](https://www.w3schools.com/python/ref_string_format.asp) and the main thing to recognise is that this means 'format this float with 7 digits to the left of the left of the decmial and 2 digits to the right'. The link I've provided uses the slightly older approach of `<str>.format()` but the formatting approach is the same.

Does **that part** make sense?

---

Now we're going to see how the code `[x for x in list]` gives us a way to apply an operation (converting to string, subtracting a value, etc.) to every item in a list without writing out a full for loop. This basically gives us a one-line way to avoid writing:
```python
rs = []
for x in myData['Population']:
    rs.append((x-mean)/std)
```
So here code in the `for` loop is applied and the result automatically added to the output list.

In [53]:
myData = {
    'id'         : [0, 1, 2, 3, 4, 5],
    'Name'       : ['Greater London', 'Greater Manchester', 'Birmingham','Edinburgh','Inverness','Lerwick'],
    'Rank'       : [1, 2, 3, 4, 5, 6],
    'Longitude'  : [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
    'Latitude'   : [51.507, 53.479, 52.480, 55.953, 57.478, 60.155],
    'Population' : [9787426, 2705000, 1141816, 901455, 70000, 6958],
}

mean = np.mean(myData['Population'])
std  = np.std(myData['Population'])

rs = [(x - mean)/std for x in myData['Population']]
myData['Std. Population'] = rs
print(myData['Std. Population'])

[2.1579383252868527, 0.0791199354729932, -0.3797024575689938, -0.45025269939207097, -0.6942995760276591, -0.7128035277711219]


In [54]:
myData

{'id': [0, 1, 2, 3, 4, 5],
 'Name': ['Greater London',
  'Greater Manchester',
  'Birmingham',
  'Edinburgh',
  'Inverness',
  'Lerwick'],
 'Rank': [1, 2, 3, 4, 5, 6],
 'Longitude': [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
 'Latitude': [51.507, 53.479, 52.48, 55.953, 57.478, 60.155],
 'Population': [9787426, 2705000, 1141816, 901455, 70000, 6958],
 'Std. Population': [2.1579383252868527,
  0.0791199354729932,
  -0.3797024575689938,
  -0.45025269939207097,
  -0.6942995760276591,
  -0.7128035277711219]}

This should output:
```
[2.1579383252868527, 0.0791199354729932, -0.3797024575689938, -0.45025269939207097, -0.6942995760276591, -0.7128035277711219]
```

In [30]:
print("City name: " + ", ".join( myData['Name'] ))
print("Raw population: " + ", ".join( [str(x) for x in myData['Population']] ))
print("Standardised population: " + ", ".join( [f"{x:4.3f}" for x in myData['Std. Population']] ))

City name: Greater London, Greater Manchester, Birmingham, Edinburgh, Inverness, Lerwick
Raw population: 9787426, 2705000, 1141816, 901455, 70000, 6958
Standardised population: 2.158, 0.079, -0.380, -0.450, -0.694, -0.713


This is where our new approach really comes into its own: because all of the population data is in one place (a.k.a. a _series_ or column), we can just throw the whole list into the `np.mean` function rather than having to use all of those convoluted loops and counters. Simples, right? 

No, not _simple_ at all conceptually, but we've come up with a way to _make_ it simple _as code_.

#### Brain Teaser

Why not have a stab at writing the code to print out the _4th most populous_ city? This can _still_ be done on one line, though you might want to start by breaking the problem down:
1. How do I find the _4th_ largest value in a list?
2. How do I find the _index_ of the 4th largest value in a list?
3. How do I use that to look up the name associated with that index?

You've already done \#2 and \#3 above so you've _solved_ that problem. If you can solve \#1 then the rest should fall into place.

<div style="border: dotted 1px green; padding: 10px; margin: 5px; background-color: rgb(249,255,249);"><i>Hint</i>: you don't want to use <tt>&lt;list&gt;.sort()</tt> because that will sort your data <i>in place</i> and break the link between the indexes across the 'columns'; you want to research the function <tt>sorted(&lt;list&gt;)</tt> where <tt>&lt;list&gt;</tt> is the variable that holds your data and `sorted(...)` just returns whatever you pass it in a sorted order <i>without</i> changing the original list. You'll see why this matters if you get the answer... otherwise, wait a few days for the answers to post.</div>

In [None]:
sorted(myData['Population'], reverse=True)

In [36]:
# Print out the name of the 4th most populous city-region
myData = {
    'id'         : [0, 1, 2, 3, 4, 5],
    'Name'       : ['Greater London', 'Greater Manchester', 'Birmingham','Edinburgh','Inverness','Lerwick'],
    'Rank'       : [1, 2, 3, 4, 5, 6],
    'Longitude'  : [-0.128, -2.245, -1.903, -3.189, -4.223, -1.145],
    'Latitude'   : [51.507, 53.479, 52.480, 55.953, 57.478, 60.155],
    'Population' : [9787426, 2705000, 1141816, 901455, 70000, 6958],
}

# Find the fourth largest value
fourth = sorted(myData['Population'], reverse=True)[3]
print(fourth)

#sort 是应用在 list 上的方法，sorted 可以对所有可迭代的对象进行排序操作。

#list 的 sort 方法返回的是对已经存在的列表进行操作，无返回值，而内建函数 sorted 方法返回的是一个新的 list，而不是在原来的基础上进行的操作。


901455


In [37]:
# Find the index of the fourth largest value
idx = myData['Population'].index(fourth)
print(idx)

# Find the city associated with that value
city = myData['Name'][idx]

# And output
print("The fourth most populous city is: " + str(city))

3
The fourth most populous city is: Edinburgh


The answer is Edinburgh.

#### Recap!

So the _really_ clever bit in all of this isn't switching from a list-of-lists to a dictionary-of-lists, it's recognising that the dictionary-of-lists is a _better_ way to work _with_ the data that we're trying to analyse and that that there are useful functions that we can exploit to do the heavy lifting for us. Simply by changing the way that we stored the data in a 'data structure' (i.e. complex arrangement of lists, dictionaries, and variables) we were able to do away with lots of for loops and counters and conditions, and reduce many difficult operations to something that could be done on one line! 

## Task 1. Creating a Set of Functions

Let's start trying to put this all together by creating a a set of functions that will help us to:

1. Download a file from a URL (checking if it has already _been_ downloaded to save bandwidth).
2. Parse it as a CSV file and...
3. Convert it to a Dictionary-of-Lists
4.Perform some simple calculations using the resulting data.

To be honest, there's not going to be much about writing our _own_ objects here, but we will be making use of them and, conceptually, an understanding of objects and classes is going to be super-useful for understanding what we're doing in the remainder of the term!

#### Task 1.1: Start from Existing Code

First, let's be sensibly lazy--we've already written code to read a file ([2020-08-24-sample-listings.csv](https://raw.githubusercontent.com/jreades/fsds/master/data/2020-08-24-sample-listings-simple.csv)) from the Internet and turn it into a list of lists. So I've copy+pasted that into the code block below since we're going to start from this point; however, just to help you check your own understanding, I've removed a few bits and replacement with `??`. Sorry. 😈

In [137]:
from urllib.request import urlopen
import csv

url = "https://raw.githubusercontent.com/jreades/fsds/master/data/2020-08-24-sample-listings-simple.csv"

urlData = [] # Somewhere to store the data

response = urlopen(url) # Get the data using the urlopen function
csvfile  = csv.reader(response.read().decode('utf-8').splitlines()) # Pass it over to the reader

for row in csvfile:
    urlData.append(row)

print("urlData has " + str(len(urlData)) + " rows and " + str(len(urlData[0])) + " columns.")
print(urlData[-1][:2]) # Check it worked!

urlData has 101 rows and 19 columns.
['40373464', 'Modern, Small Double Private Room']


You should get `urlData has 101 rows and 19 columns.` and a row that looks like this: `['40373464', 'Modern, Small Double Private Room']`.

#### Task 1.2: Getting Organised

Let's take the code above and modify it so that it is:

1. A function that takes two arguments: a URL; and a destination filename.
2. Implemented as a function that checks if a file exists already before downloading it again.

You will find that the `os` module helps here because of the `path` function. And you will [need to Google](https://lmgtfy.app/?q=check+if+file+exists+python) how to test if a file exists. I would normally select a StackOverflow link in the results list over anything else because there will normally be an _explanation_ included of why a particular answer is a 'good one'. I also look at which answers got the most votes (not always the same as the one that was the 'accepted answer'). In this particular case, I also found [this answer](https://careerkarma.com/blog/python-check-if-file-exists/) useful.

--- 

I would start by setting my inputs:

In [138]:
import os
url = "https://raw.githubusercontent.com/jreades/fsds/master/data/2020-08-24-sample-listings-simple.csv"
out = os.path.join('data','2020-08-24-sample-listings.csv')
out

'data/2020-08-24-sample-listings.csv'

#### Task 1.3: Sketching Out a Function

Then I would sketch out how my function will work using comments. And the simplest thing to start with is checking whether the file has already been downloaded:

In [139]:
import os
url = "https://raw.githubusercontent.com/jreades/fsds/master/data/2020-08-24-sample-listings-simple.csv"
out = os.path.join('data','2020-08-24-sample-listings.csv')

from urllib.request import urlopen
import csv

def get_url(src, dest):
    
    # Check if dest does *not* exist -- that
    # would mean we had to download it!
    if not os.path.isfile(dest):
        print(f"{dest} not found!")
    else:
        print(f"Found {dest}!")
        
get_url(url, out)

Found data/2020-08-24-sample-listings.csv!


#### Task 1.4: Fleshing Out a Function 

I would then flesh out the code that checks if the data has been downloaded and ensure that both my if and else 'branches' return a list that I could then read using the CSV library:

In [140]:
from urllib.request import urlopen
import csv
import os

def get_url(src, dest):
    
    # Check if dest does *not* exist -- that
    # would mean we had to download it!
    if not os.path.isfile(dest):
        print(f"{dest} not found, downloading!")
        
        # Get the data using the urlopen function
        response = urlopen(src) 
        filedata = response.read().decode('utf-8')
        
        # Extract the part of the dest(ination) that is *not*
        # the actual filename--have a look at how 
        # os.path.split works using `help(os.path.split)`
        path = list(os.path.split(dest)[:-1])
        
        # Create any missing directories in dest(ination) path
        # -- os.path.join is the reverse of split (as you saw above)
        # but it doesn't work with lists... so I had to google how 
        # to use the 'splat' operator! os.makedirs creates missing 
        # directories in a path automatically.
        if len(path) >= 1 and path[0] != '':
            os.makedirs(os.path.join(*path), exist_ok=True)
        
        # This would be how to write data to a file, 
        # but what should we write?
        with open(dest, 'w') as f:  # w:write
            f.write(filedata)
            
    else:
        print(f"Found {dest} locally!")
    
    with open(dest, 'r', encoding='utf-8') as f:
        return f.read().splitlines()
        
# Using the `return contents` line we make it easy to 
# see what our function is up to.
c = get_url(url, out)


Found data/2020-08-24-sample-listings.csv locally!


In [141]:
print (c)

['id,name,host_id,host_name,host_since,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,availability_365,number_of_reviews,calculated_host_listings_count', '25339003,"An Amazing 4Bedroom Home, Central London, Sleeps12",191329110,Emily,2018-05-24,51.52865,-0.19998,Entire house,Entire home/apt,12,,4.0,9.0,$226.00,2,1125,23,64,1', '40259218,Large Double Room - Maida Vale,302720259,Mantas,2019-10-16,51.52594000000001,-0.18909,Private room in apartment,Private room,2,,1.0,1.0,$41.00,1,24,365,4,5', '20097666,Zone 1 : Spacious single bedroom in Bayswater,39710946,Thanyawan,2015-07-27,51.51743,-0.18702,Private room in apartment,Private room,1,,1.0,1.0,$35.00,4,1125,0,0,1', '40868766,Large Smart Room 20 minutes walk to Big Ben,300905210,Nadia,2019-10-08,51.48951,-0.11812,Private room in apartment,Private room,2,,1.0,2.0,$55.00,2,1125,230,4,3', '29649371,Large Notting Hill 2BR near Portobello Rd,2165973,Emily & Kirsty,2012-04-18,

<div style="border: dotted 1px rgb(156,121,26); padding: 10px; margin: 5px; background-color: rgb(255,236,184)"><i>Stop!</i> Notice that we don't try to check if the data file contains any useful data! So if you download or create an empty file while testing, you won't necessarily get an error until you try to turn it into data afterwards!</div>

#### Task 1.5: Read a CSV file into a LoL

Now we've taken care of whether or not the file has already been downloaded, we can focus on the next part of the problem! We have looked at code like this before in the Live sessions.

In [142]:
from urllib.request import urlopen
import csv
import os

# Notice that it doesn't make sense to use `dest` as the 
# parameter name here because we always read *from* a data
# source. Your names can be whatever you want, but they 
# should be logical wherever possible!
def to_lol(lst):
    
    # Rest of code to read file and convert it goes here
    csvdata = []
    
    # This is the same code that you used last week, but 
    # you'll have to rename some vars to get things to
    # work for you here.
    csvfile  = csv.reader(lst)
    for row in csvfile:              
        csvdata.append( row )
    
    # Return list of lists
    return csvdata
        
# Save the CSV-LoL to a new variable
clol = to_lol(c)

In [143]:
print(f"LoL has {len(clol)} rows and {len(clol[0])} columns.")
print(clol[0][:2])
print(clol[-1][:2])

LoL has 101 rows and 19 columns.
['id', 'name']
['40373464', 'Modern, Small Double Private Room']


You should get: 
```
LoL has 101 rows and 19 columns.
['id', 'name']
['40373464', 'Modern, Small Double Private Room']
```

#### Task 1.6: Convert a LoL to a DoL

We're going to assume that the first row of our LoL is always a _header_ (i.e. list of column names). If it's not then this code is going to have problems. A _robust_ function would allow us to specify column names when we create the data structure, but let's not get caught up in that level of detail just yet.

Have a look at Task 2.3 from the Live Coding session to see how to fill this in... Notice that I've also, for the first time used the docstring support offered by Python. Once this function is working you'll be able to use `help(to_dol)` and get back the docstring help!

In [144]:
def to_dol(lol):
    """
    Converts a list-of-lists (LoL) to a dict-of-lists (dol)
    using the first element in the LoL to create column names.
    
    :param lol: a list-of-lists where each element of the list represents a row of data
    :returns: a dict-of-lists
    """
    # Create empty dict-of-lists
    ds = {}

    # I had a version of this code that used
    # lol.pop(0) since it made the for loop
    # easier to read. But I changed my mind...
    #
    # Can you think why?
    clo_names = lol[0]
    # Write the code to create the keys and empty lists (HINT: for loop)
    for key in clo_names:
        ds[key] = []

    # Then values into a list attached to each key
    # and write the code to append values to each list
    for row in lol[1:]:
        for c in range(0,len(clo_names)):
            ds[clo_names[c]].append(row[c])
            
    return ds

ds = to_dol(clol)

In [145]:
ds["name"][0:5]

['An Amazing 4Bedroom Home, Central London, Sleeps12',
 'Large Double Room - Maida Vale',
 'Zone 1 : Spacious single bedroom in Bayswater',
 'Large Smart Room 20 minutes walk to Big Ben',
 'Large Notting Hill 2BR near Portobello Rd']

In [146]:
print(", ".join(ds.keys()))
print(ds['id'][:2])
print(ds['name'][:2])
print(ds['room_type'][:2])

id, name, host_id, host_name, host_since, latitude, longitude, property_type, room_type, accommodates, bathrooms, bedrooms, beds, price, minimum_nights, maximum_nights, availability_365, number_of_reviews, calculated_host_listings_count
['25339003', '40259218']
['An Amazing 4Bedroom Home, Central London, Sleeps12', 'Large Double Room - Maida Vale']
['Entire home/apt', 'Private room']


The answer should look like:
```
id, name, host_id, host_name, host_since, latitude, longitude, property_type, room_type, accommodates, bathrooms, bedrooms, beds, price, minimum_nights, maximum_nights, availability_365, number_of_reviews, calculated_host_listings_count
['25339003', '40259218']
['An Amazing 4Bedroom Home, Central London, Sleeps12', 'Large Double Room - Maida Vale']
['Entire home/apt', 'Private room']
```

#### Task 1.7: Convert Data Types on DoL

 You'll need to investigate the columns yourself in order to see what the appropraite values should be. I would suggest taking the _full_ version of the function where we check what `cdata` is so that we have one function that works for both strings and lists.

Just to help get you started, here are the column names and you can create a `dtype` list to hold the _data type_ for each column.

In [147]:
ds["bathrooms"][0:10]

['', '', '', '', '', '', '', '', '', '']

In [156]:
cols  = ['id', 'name', 'host_id', 'host_name', 
        'host_since', 'latitude', 'longitude', 'property_type', 
        'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 
        'price', 'minimum_nights', 'maximum_nights', 
        'availability_365', 'number_of_reviews', 
        'calculated_host_listings_count']
dtype = [int, str, int, str,
        str, float, float, str,
        str, int, bool, float, float, 
        str, int, int, int, float]

# 'Zips up' these two lists into an iterator list of tuples!
# Note than you cannot save the output of zip directly because
# you can only iterate through it once.
for d in zip(cols, dtype):
    # Notice the more advanced formatting here:
    # - `>20` means right-align with up to 20 characters of whitespace; notice the last line!
    # - `d[1].__name__` gives us the name of the data type, rather than a '<class...>' output.
    print(f"Column ({d[0]:>20}) is type: {d[1].__name__}")

Column (                  id) is type: int
Column (                name) is type: str
Column (             host_id) is type: int
Column (           host_name) is type: str
Column (          host_since) is type: str
Column (            latitude) is type: float
Column (           longitude) is type: float
Column (       property_type) is type: str
Column (           room_type) is type: str
Column (        accommodates) is type: int
Column (           bathrooms) is type: bool
Column (            bedrooms) is type: float
Column (                beds) is type: float
Column (               price) is type: str
Column (      minimum_nights) is type: int
Column (      maximum_nights) is type: int
Column (    availability_365) is type: int
Column (   number_of_reviews) is type: float


Make sure tha tyou understand how this works:

In [149]:
# Convert the raw data to data of the appropriate
# type: 'column data' (cdata) -> 'column type' (ctype)
def to_type(cdata, ctype):
    # If a string
    if isinstance(cdata, str):
        try:
        # 避免error，筛选出bool类型的string
            if ctype==bool:
                return cdata==True
            else:
                return ctype(cdata)
        except TypeError:
            return cdata
    
    # Not a string (assume list)
    else: 
        fdata = []
        for c in cdata:
            try:
                if ctype==bool:
                    fdata.append( c=='True' )
                else:
                    fdata.append( ctype(c) )
            except:
                fdata.append( c )
        return fdata

In [150]:
# Now apply this! We'll copy the data to 
# new data structure only so that we know
# we're not overwriting `ds` until we're sure
# that the code works.
ds2 = {}
for d in zip(cols, dtype):
    ds2[ d[0] ] = to_type(ds[d[0]], d[1])

In [151]:
print(ds2['id'][:3])
print(ds2['host_name'][:3])
print(ds2['beds'][:3])
print(ds2['availability_365'][:3])


[25339003, 40259218, 20097666]
['Emily', 'Mantas', 'Thanyawan']
[9.0, 1.0, 1.0]
[23, 365, 0]


In [165]:
print(type('availability_365'))

<class 'str'>


You should get the followg:
```
[25339003, 40259218, 20097666]
['Emily', 'Mantas', 'Thanyawan']
[9.0, 1.0, 1.0]
[23, 365, 0]
['2020-03-01', '2020-02-08', '']
```

#### Task 1.8: Checking Basic Functionality

Let's see if our new data structure broadly works by testing out some of our previous operations:

In [152]:
import numpy as np # We'll need this

In [163]:
print(ds['availability_365'])

['23', '365', '0', '230', '0', '0', '365', '0', '0', '340', '246', '363', '89', '3', '0', '365', '180', '88', '365', '64', '87', '0', '83', '0', '52', '365', '0', '0', '0', '0', '237', '357', '0', '90', '329', '0', '11', '0', '356', '34', '0', '338', '0', '172', '365', '0', '336', '0', '0', '0', '2', '85', '91', '75', '327', '122', '89', '0', '33', '329', '168', '178', '0', '5', '179', '169', '305', '365', '318', '55', '338', '0', '364', '0', '365', '270', '0', '0', '170', '298', '0', '83', '364', '0', '0', '0', '0', '252', '363', '0', '268', '0', '80', '0', '0', '90', '0', '365', '52', '0']


In [164]:
print(type(ds['availability_365']))

<class 'list'>


In [169]:
integer_availability_365= map(int, ds['availability_365'])
integer_availability_365_list = list(integer_availability_365)

In [171]:
print((integer_availability_365_list))

[23, 365, 0, 230, 0, 0, 365, 0, 0, 340, 246, 363, 89, 3, 0, 365, 180, 88, 365, 64, 87, 0, 83, 0, 52, 365, 0, 0, 0, 0, 237, 357, 0, 90, 329, 0, 11, 0, 356, 34, 0, 338, 0, 172, 365, 0, 336, 0, 0, 0, 2, 85, 91, 75, 327, 122, 89, 0, 33, 329, 168, 178, 0, 5, 179, 169, 305, 365, 318, 55, 338, 0, 364, 0, 365, 270, 0, 0, 170, 298, 0, 83, 364, 0, 0, 0, 0, 252, 363, 0, 268, 0, 80, 0, 0, 90, 0, 365, 52, 0]


In [173]:
print(f"Average availability over 365 days is {np.mean(integer_availability_365_list)}")


Average availability over 365 days is 129.15


In [176]:
print(type(integer_availability_365_list))

<class 'list'>


In [178]:
print(ds['minimum_nights'])

['2', '1', '4', '2', '3', '2', '2', '14', '2', '2', '4', '3', '3', '50', '2', '3', '1', '5', '1', '12', '2', '4', '1', '3', '1', '15', '7', '4', '1', '3', '1', '2', '1', '1', '5', '1', '3', '2', '2', '2', '3', '3', '2', '2', '1', '3', '1', '3', '1', '2', '1', '1', '2', '28', '3', '5', '1', '2', '7', '2', '3', '3', '3', '30', '2', '2', '7', '1', '1', '1', '3', '2', '1', '5', '90', '1', '2', '6', '2', '4', '3', '3', '3', '6', '3', '2', '7', '1', '1', '1', '4', '1', '1', '4', '5', '2', '3', '1', '2', '1']


In [181]:
integer_minimum_nights= map(int, ds['minimum_nights'])
integer_minimum_nights_list = list(integer_minimum_nights)
print(integer_minimum_nights_list)

[2, 1, 4, 2, 3, 2, 2, 14, 2, 2, 4, 3, 3, 50, 2, 3, 1, 5, 1, 12, 2, 4, 1, 3, 1, 15, 7, 4, 1, 3, 1, 2, 1, 1, 5, 1, 3, 2, 2, 2, 3, 3, 2, 2, 1, 3, 1, 3, 1, 2, 1, 1, 2, 28, 3, 5, 1, 2, 7, 2, 3, 3, 3, 30, 2, 2, 7, 1, 1, 1, 3, 2, 1, 5, 90, 1, 2, 6, 2, 4, 3, 3, 3, 6, 3, 2, 7, 1, 1, 1, 4, 1, 1, 4, 5, 2, 3, 1, 2, 1]


In [190]:
print(f"Standard deviation on minimum nights is {np.std(integer_minimum_nights_list):.2f}")

Standard deviation on minimum nights is 10.69


In [191]:
print(f"Standard deviation on minimum nights is {np.std(ds2['minimum_nights']):.2f}")

Standard deviation on minimum nights is 10.69


In [186]:
integer_maximum_nights= map(int, ds['maximum_nights'])
integer_maximum_nights_list = list(integer_maximum_nights)
print(integer_maximum_nights_list)

[1125, 24, 1125, 1125, 1125, 30, 1125, 1125, 1125, 1125, 1125, 1125, 7, 1125, 365, 30, 1125, 60, 1125, 99, 21, 1125, 7, 1125, 1125, 1125, 1125, 1125, 1125, 365, 365, 45, 365, 1125, 1125, 1125, 1125, 1125, 1125, 1125, 1125, 90, 1125, 7, 1125, 1125, 1125, 1125, 1125, 90, 1125, 20, 1125, 1125, 1125, 1125, 15, 1125, 1125, 90, 14, 30, 1125, 45, 30, 1125, 1125, 30, 180, 9, 1125, 1125, 1125, 1125, 1125, 365, 7, 1125, 1125, 365, 1124, 10, 93, 20, 14, 28, 1125, 28, 14, 180, 30, 1125, 1125, 12, 30, 30, 1125, 1125, 31, 1125]


In [187]:
print(f"Median maximum nights is {np.median(integer_maximum_nights_list)}")

Median maximum nights is 1125.0


But...

In [189]:
print(f"Median price per night is {np.median(ds2['price'])}")

TypeError: cannot perform reduce with flexible type

Why is this happening? Write some code below to check:

In [228]:
ds2['price'][0:]

['$226.00',
 '$41.00',
 '$35.00',
 '$55.00',
 '$100.00',
 '$68.00',
 '$150.00',
 '$110.00',
 '$130.00',
 '$45.00',
 '$166.00',
 '$29.00',
 '$20.00',
 '$150.00',
 '$102.00',
 '$29.00',
 '$300.00',
 '$203.00',
 '$56.00',
 '$30.00',
 '$50.00',
 '$230.00',
 '$23.00',
 '$413.00',
 '$149.00',
 '$32.00',
 '$181.00',
 '$199.00',
 '$26.00',
 '$150.00',
 '$15.00',
 '$25.00',
 '$89.00',
 '$30.00',
 '$260.00',
 '$76.00',
 '$96.00',
 '$148.00',
 '$337.00',
 '$95.00',
 '$297.00',
 '$170.00',
 '$55.00',
 '$59.00',
 '$430.00',
 '$110.00',
 '$504.00',
 '$175.00',
 '$26.00',
 '$85.00',
 '$156.71',
 '$26.00',
 '$37.00',
 '$55.00',
 '$102.00',
 '$798.00',
 '$60.00',
 '$120.00',
 '$120.00',
 '$50.00',
 '$40.00',
 '$98.00',
 '$185.00',
 '$25.00',
 '$45.00',
 '$250.00',
 '$58.00',
 '$45.00',
 '$19.00',
 '$49.00',
 '$221.00',
 '$104.00',
 '$65.00',
 '$95.00',
 '$70.00',
 '$30.00',
 '$38.00',
 '$35.00',
 '$165.00',
 '$46.71',
 '$515.00',
 '$55.00',
 '$47.00',
 '$85.00',
 '$150.00',
 '$65.00',
 '$170.00',
 '$16

#### Task 1.9: Putting It All Together

Here's a clue for how to solve the 'price' data problem; you will need to combine it with something we've seen earlier that allows you to iterate over a list and apply the solution to every `x` in the 'price' column. If you are nearing the end of the 2-hour practical, then may skip this task for now and work on converting the functions to a package (next task below).

In [240]:
price = []

for x in ds2['price']:
    price.append(float(x.replace('$','')))

AttributeError: 'float' object has no attribute 'replace'

In [230]:
print(price)

[226.0, 41.0, 35.0, 55.0, 100.0, 68.0, 150.0, 110.0, 130.0, 45.0, 166.0, 29.0, 20.0, 150.0, 102.0, 29.0, 300.0, 203.0, 56.0, 30.0, 50.0, 230.0, 23.0, 413.0, 149.0, 32.0, 181.0, 199.0, 26.0, 150.0, 15.0, 25.0, 89.0, 30.0, 260.0, 76.0, 96.0, 148.0, 337.0, 95.0, 297.0, 170.0, 55.0, 59.0, 430.0, 110.0, 504.0, 175.0, 26.0, 85.0, 156.71, 26.0, 37.0, 55.0, 102.0, 798.0, 60.0, 120.0, 120.0, 50.0, 40.0, 98.0, 185.0, 25.0, 45.0, 250.0, 58.0, 45.0, 19.0, 49.0, 221.0, 104.0, 65.0, 95.0, 70.0, 30.0, 38.0, 35.0, 165.0, 46.71, 515.0, 55.0, 47.0, 85.0, 150.0, 65.0, 170.0, 16.0, 45.0, 25.0, 99.0, 25.0, 50.0, 17.0, 95.0, 50.0, 105.0, 142.0, 263.0, 38.0]


In [231]:
print(f"Median price per night is {np.median(price)}")

Median price per night is 80.5


In [196]:
 float(ds2['price'][0].replace('$','')) 

226.0

## Task 2. Creating a Package from Functions

Using or adapting as necessary the approach that we saw in the Live Coding session (Task 2 from Part 1) create a package called `dtools` by exporting the functions you've created above (only the final version of each, so don't export the one form Task 1.3 for instance) into a file called `__init__.py` that sits in the `dtools` directory.

In [70]:
!mkdir -p dtools

In [76]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True \
    --to python --output=dtools/__init__.py \
    Practical-04-Objects.ipynb

[NbConvertApp] Converting notebook Practical-04-Objects.ipynb to python
[NbConvertApp] Writing 33339 bytes to dtools/__init__.py


Once you have tidied up the content of `dtools/__init__.py` you should be able to run the code below. You can actually edit the `init` file directly in Jupyter as a text file. You can compare this to the file I've created on GitHub.

In [232]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [235]:
%reload_ext autoreload
%autoreload 2

In [236]:
import dtools
help(to_dol)

SyntaxError: f-string: unmatched '[' (__init__.py, line 803)

In [87]:
url = 'https://raw.githubusercontent.com/jreades/fsds/master/data/2019-sample-crime.csv'
out = os.path.join('data','crime-sample.csv')
out

'data/crime-sample.csv'

In [237]:
dlol = dtools.get_url(url, out)
dlol = dtools.to_lol(dlol)
ddol = dtools.to_dol(dlol)

print(len(ddol.keys()))
print(len(ddol['ID']))

NameError: name 'dtools' is not defined

In [222]:
print(ddol.keys())

NameError: name 'ddol' is not defined