<a href="https://colab.research.google.com/github/user1inna/data-and-python/blob/main/Worksheets/07_Processing_data_from_data_files_with_lists.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data in files

The most common source of data is in files.  One popular file format is CSV (comma separated values).  CSV files store data in table form with all the data items stored as text and separate by commas.  Data is organised with one record per row.

CSV is the smallest format to use, reducing storage space requirements and the time taken to transfer large files from one computer to another.

In this worksheet you will be given the code to read a table of data from a CSV file and will be given the column names.  Each column will hold a set of data of the same type (number, text, date).

You will be able to make lists following the example, and then use Python to get information (such as length, max, min, sum, average) or to print the list, or particular parts of the list.

---
[Video](https://vimeo.com/996732163/eb811ace14?share=copy)

---
### Read the file

This code will read the file and create a table from which you can create a lists, following the example.  

It uses a library called **pandas** which is built to work with large data files.  It is common to refer to pandas as pd to keep the amount of typing short.  

**Run the code to see what the data set looks like**.  As long as you have run this code cell, the data table will always be available lower down in the notebook and will always be called `dataset`.

In [6]:
import pandas as pd

def get_dataset():
  url = "https://raw.githubusercontent.com/futureCodersSE/working-with-data/main/Data%20sets/housing_in_london_yearly_variables.csv"
  return pd.read_csv(url)

dataset = get_dataset()
display(dataset)

Unnamed: 0,code,area,date,median_salary,life_satisfaction,mean_salary,recycling_pct,population_size,number_of_jobs,area_size,no_of_houses,borough_flag
0,E09000001,city of london,1999-12-01,33020.0,,48922,0,6581.0,,,,1
1,E09000002,barking and dagenham,1999-12-01,21480.0,,23620,3,162444.0,,,,1
2,E09000003,barnet,1999-12-01,19568.0,,23128,8,313469.0,,,,1
3,E09000004,bexley,1999-12-01,18621.0,,21386,18,217458.0,,,,1
4,E09000005,brent,1999-12-01,18532.0,,20911,6,260317.0,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...
1066,K03000001,great britain,2019-12-01,30446.0,,37603,,,,,,0
1067,K04000001,england and wales,2019-12-01,30500.0,,37865,,,,,,0
1068,N92000002,northern ireland,2019-12-01,27434.0,,32083,,,,,,0
1069,S92000003,scotland,2019-12-01,30000.0,,34916,,,,,,0


<class 'pandas.core.frame.DataFrame'>


---
### Get the list of areas

To create a list from a column use the code below (this code will get the area column, to get the `median_salary` or the `population_size` column, just replace `area` with an exact copy of the column header).  


In [11]:
areas = list(dataset['area'])
for area in areas:
  print(area)

city of london
barking and dagenham
barnet
bexley
brent
bromley
camden
croydon
ealing
enfield
greenwich
hackney
hammersmith and fulham
haringey
harrow
havering
hillingdon
hounslow
islington
kensington and chelsea
kingston upon thames
lambeth
lewisham
merton
newham
redbridge
richmond upon thames
southwark
sutton
tower hamlets
waltham forest
wandsworth
westminster
north east
north west
yorkshire and the humber
east midlands
west midlands
east
london
south east
south west
inner london
outer london
england
united kingdom
great britain
england and wales
northern ireland
scotland
wales
city of london
barking and dagenham
barnet
bexley
brent
bromley
camden
croydon
ealing
enfield
greenwich
hackney
hammersmith and fulham
haringey
harrow
havering
hillingdon
hounslow
islington
kensington and chelsea
kingston upon thames
lambeth
lewisham
merton
newham
redbridge
richmond upon thames
southwark
sutton
tower hamlets
waltham forest
wandsworth
westminster
north east
north west
yorkshire and the humber
e

---
### Exercise 1 - sort the areas list into alphabetical order

Write a function that will:  
*  Use the `.sort()` method to sort the areas into alphabetical order.  
*  Use a for loop to print the sorted list

In [13]:
def sort_areas():
  areas.sort()
  for area in areas:
    print(area)

sort_areas()

barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barking and dagenham
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
barnet
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
bexley
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
brent
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bromley
bro

---
### Exercise 2 - create another list

Create a new list called **median_salaries**.  Print the `median_salaries` list, one item per line.  


In [15]:
def median_salaries():
  media_salaries = list(dataset['median_salary'])
  for salary in media_salaries:
    print(salary)

median_salaries()

33020.0
21480.0
19568.0
18621.0
18532.0
16720.0
23677.0
19563.0
20580.0
19289.0
21236.0
23249.0
25000.0
18783.0
20596.0
17165.0
24002.0
20155.0
25113.0
20646.0
19302.0
23151.0
20580.0
18962.0
18862.0
19580.0
22321.0
22784.0
19582.0
26376.0
18547.0
21321.0
24447.0
16282.0
16977.0
16527.0
16392.0
17000.0
18000.0
22487.0
18737.0
16727.0
nan
nan
17939.0
17803.0
17866.0
17974.0
15798.0
16914.0
16457.0
34903.0
22618.0
21761.0
19363.0
22348.0
16401.0
25484.0
21339.0
22512.0
22467.0
22121.0
24083.0
25264.0
19772.0
22238.0
17418.0
25038.0
23230.0
26598.0
22952.0
20983.0
24264.0
22357.0
18906.0
19437.0
20256.0
22574.0
24857.0
22019.0
28445.0
19776.0
22840.0
25951.0
17430.0
17863.0
17503.0
17352.0
17812.0
19020.0
24204.0
19992.0
17847.0
nan
nan
19107.0
18848.0
18915.0
19000.0
16599.0
18029.0
17157.0
39104.0
22323.0
20916.0
20217.0
21878.0
15684.0
27386.0
20889.0
23862.0
24136.0
20761.0
24095.0
26319.0
20349.0
22960.0
17940.0
26051.0
22853.0
28323.0
24979.0
21427.0
24019.0
21990.0
21014.0
21054.0


---
### Exercise 3 - print statistics about the median salaries

Write a function that will:  

*  print the number of salaries in the list
*  create a variable called **largest**, assign it the largest salary in the list
*  print the value of `largest`
*  create a variable called **smallest**, assign it the smallest salary in the list
*  print the value of `smallest`
*  calculate and print the difference between `largest` and `smallest`
*  sort the `median_salaries` list into ascending order
*  calculate the `index`(or position) of the value in the middle of the list (as an integer)
*  print the item at that position in the list

**Expected output**
1071
61636.0
15684.0
45952.0
32681.0

In [23]:
def create_salary_stats():
  print(len(dataset['median_salary']))
  largest = max(dataset['median_salary'])
  print(largest)
  smallest = min(dataset['median_salary'])
  print(smallest)
  diff= largest - smallest
  print(diff)
  sorted_salaries = dataset['median_salary'].sort_values()
  print(sorted_salaries)
  index = len(sorted_salaries) // 2
  print(sorted_salaries[index])


create_salary_stats()

1071
61636.0
15684.0
45952.0
107     15684.0
48      15798.0
168     16123.0
33      16282.0
36      16392.0
         ...   
989         NaN
994         NaN
1025        NaN
1062        NaN
1063        NaN
Name: median_salary, Length: 1071, dtype: float64
31235.0


---
### Exercise 4 - create a population list

Create a new list called **population_sizes**.  Print the `population_sizes` list, one item per line.  

In [26]:
population = list(dataset["population_size"])
for pop in population:
  print(pop)

6581.0
162444.0
313469.0
217458.0
260317.0
294902.0
190003.0
332066.0
302252.0
272731.0
212168.0
199087.0
160634.0
218559.0
207909.0
225712.0
245053.0
214298.0
175717.0
147678.0
146003.0
266817.0
250310.0
185062.0
240517.0
238138.0
172782.0
247853.0
179375.0
193507.0
221057.0
264220.0
189233.0
2550314.0
6773115.0
4956325.0
4152443.0
5271959.0
5338722.0
7153912.0
7955124.0
4880958.0
2750716.0
4403196.0
49032872.0
58684427.0
57005421.0
51933471.0
1679006.0
5071950.0
2900599.0
7014.0
163893.0
315784.0
218717.0
264945.0
295317.0
196174.0
334241.0
304370.0
275068.0
214438.0
203381.0
164393.0
219845.0
209114.0
225141.0
245911.0
214731.0
177852.0
154661.0
147328.0
270028.0
252106.0
188196.0
245463.0
239868.0
172920.0
252726.0
180485.0
197133.0
221296.0
267695.0
196478.0
2543421.0
6774223.0
4958609.0
4168076.0
5269626.0
5374972.0
7236712.0
7990598.0
4917074.0
2804949.0
4431763.0
49233311.0
58886065.0
57203121.0
52140181.0
1682944.0
5062940.0
2906870.0
7359.0
165654.0
319481.0
218757.0
269620.0

---
### Exercise 5 - print some statistics about population sizes

Write a function that will:

*  print the number of items in the `population_sizes` list
*  create a variable called **largest_population**, assign it the largest population size in the list
*  print the value of `largest_population`
*  create a variable called **smallest_population**, assign it the smallest population size in the list
*  print the value of `smallest_population`
*  calculate and print the difference between largest and smallest population
*  create a variable called **total** to hold the sum of the population_sizes list
*  calculate and print the average population per area
*  print the total

**Expected output**
1071
66435550.0
6581.0
66428969.0
nan

**Note:**  The last output 'nan' is stating that the sum is not a number (nan).  You may have noticed when you printed the list that some numbers were nan.  This is missing data and means that the sum function can't add the numbers up.  You will learn how to deal with this later on in the course.

In [32]:
def create_population_stats():
  population = list(dataset["population_size"])
  print(len(population))
  largest_population = max(population)
  print(largest_population)
  smallest_population = min(population)
  print(smallest_population)
  diff = largest_population - smallest_population
  print(diff)
  total = sum(population)
  print(total)
  average = total / len(population)
  print(average)

create_population_stats()

1071
66435550.0
6581.0
66428969.0
nan
nan


---
### CHALLENGE (optional)

From the exercises above, you know the largest and smallest population_size and you know the largest and smallest median_salary.

Write a function that will:
*  get new copies of the three lists (`areas`, `median_salaries`, `population_sizes` using the same code as before)
*  use the `.index()` function to get the `index` of the largest `median_salary`
*  print the `area` that is at this `index` in the `areas` list
*  use the `.index()` function to get the `index` of the smallest `population_size`
*  print the `area` that is at this `index` in the `areas` list

Are the areas the same?  

**Question**:  why wouldn't it be appropriate, with this dataset, to see if the area with the largest population_size had the lowest median_salary?  If you are not sure - amend the code below so that it does largest population and smallest median_salary.

In [34]:
def find_areas():
  areas = list(dataset['area'])
  media_salaries = list(dataset['median_salary'])
  population = list(dataset["population_size"])
  largest_medium_salary = max(media_salaries)
  index = media_salaries.index(largest_medium_salary)
  print(areas[index])
  index2 = population.index(min(population))
  print(areas[index2])

find_areas()

city of london
city of london


---
# Takeaways

*  we can use the pandas library to read data from an online CSV file and store it in a table
*  the table will have columns with headings and we can convert each column into a list
*  sometimes, data is incomplete and some statistics can't be calculate without cleaning up the data

---
# Your thoughts on what you have learnt  

Please add some comments in the box below to reflect on what you have learnt through completing this worksheet, and any problems you encountered while doing so.