# **Best Practices for Writing/Documenting code**
## **Introduction**
This coding assignment will get you used to writing code efficiently as well as documenting your code through various exercises such as efficient data cleaning and Pandas code, as well as designing reusable data workflows. Remember to document all code you write in this assignment, as that will be part of your grade as well.


## **Setup**
The following code will import libraries that will be useful for this assignment.




In [1]:
!pip install pdpipe
import pandas as pd
import numpy as np
import zipfile
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import timeit
import pdpipe

%matplotlib inline

Collecting pdpipe
[?25l  Downloading https://files.pythonhosted.org/packages/37/d1/494d8d173d5c20ef745ea81c20a1edbb194c632866a996f13b946ce44146/pdpipe-0.0.53-py3-none-any.whl (48kB)
[K     |██████▊                         | 10kB 14.8MB/s eta 0:00:01[K     |█████████████▌                  | 20kB 13.9MB/s eta 0:00:01[K     |████████████████████▏           | 30kB 8.4MB/s eta 0:00:01[K     |███████████████████████████     | 40kB 7.5MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.5MB/s 
[?25hCollecting strct
  Downloading https://files.pythonhosted.org/packages/0d/24/62efe536ba1bedb8591fb83ef4d5bcdaca4147a7bfefff3e77aba1061d79/strct-0.0.32-py2.py3-none-any.whl
Collecting skutil>=0.0.15
  Downloading https://files.pythonhosted.org/packages/34/2b/1b5c9e7be3c24e1bd5ce35c2d27a5780989c3d90fcee10f3fee3074dda7f/skutil-0.0.16-py2.py3-none-any.whl
Collecting decore
  Downloading https://files.pythonhosted.org/packages/b4/69/9e3da3a87058d43e5b9f0f668f69da591b8b0c2763b3afb

## **Shortcuts**


Even if you are familiar with Jupyter, you are strongly encouraged to become proficient with keyboard shortcuts (this will save you time in the future). To learn about keyboard shortcuts, go to **Help --> Keyboard Shortcuts** in the menu above. 

Here are a few that we like:
1. `Ctrl` + `Return` : *Evaluate the current cell*
1. `Shift` + `Return`: *Evaluate the current cell and move to the next*
1. `ESC` : *command mode* (may need to press before using any of the commands below)
1. `a` : *create a cell above*
1. `b` : *create a cell below*
1. `dd` : *delete a cell*
1. `z` : *undo the last cell operation*
1. `m` : *convert a cell to markdown*
1. `y` : *convert a cell to code*

# **Importing Datasets**
The following code will import datasets used in this assignment.

In [2]:
#next 2 lines temperary
url = 'https://raw.githubusercontent.com/vikashraja24/cs189projectTfinal/main/biostats.csv'
biostats = pd.read_csv(url)
#biostats = pd.read_csv('biostats.csv')

#next 2 lines temperary
url = 'https://raw.githubusercontent.com/vikashraja24/cs189projectTfinal/main/BL-Flickr-Images-Book.csv'
books = pd.read_csv(url)
#books = pd.read_csv('BL-Flickr-Images-Book.csv')

#next 2 lines temperary
url = 'https://raw.githubusercontent.com/vikashraja24/cs189projectTfinal/main/USA_Housing.csv'
housing = pd.read_csv(url)
#housing = pd.read_csv('USA_Housing.csv')
housing

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386
...,...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316


##  **1. Efficient Pandas**
In this question, you will do some exercises in writing Pandas code efficiently, and you will to learn to speed up code when filtering and modifying data.






First, load and examine the biostats dataset.

In [3]:
biostats

Unnamed: 0,Name,"""Sex""","""Age""","""Height (in)""","""Weight (lbs)"""
0,Alex,"""M""",41,74,170
1,Bert,"""M""",42,68,166
2,Carl,"""M""",32,70,155
3,Dave,"""M""",39,72,167
4,Elly,"""F""",30,66,124
5,Fran,"""F""",33,66,115
6,Gwen,"""F""",26,64,121
7,Hank,"""M""",30,71,158
8,Ivan,"""M""",53,72,175
9,Jake,"""M""",32,69,143


**a)** Part of writing good code means readablilty, and often times datasets have issues of bad column names. Please clean the biostats dataset to have column names [Name, Sex, Age, Height, Weight]. Be sure to document any helper functions or processes in your code.

In [4]:
### start code ###


### end code ###


**b)** In this part, you will write two ways of finding the average height of people 30 or older. In this first cell, do not use any Pandas filtering or aggregation methods, and in the second cell, use Pandas method to get the same answer in one line.

In [5]:
### start code ###

### end code ###

In [6]:
### start code ###

### end code ###

**c)** Now please time the execution of both methods above to verify the effeciency of the second block.

Hint: check out the timeit library



In [7]:
### start code ###

### end code ###

In [8]:
### start code ###

### end code ###





##  **2. Pipelines**
In this question, you will do some exercises dealing with data pipelines with the housing data. When talking about the best practices of writing and documenting code in the machine learning context, efficient data pipelines allow for reusable data workflows. The true beauty in pipelines is that it automates repetitive tasks and speeds up the data cleaning. As a machine learning engineer, you may take be tasked to clean and filter multiple datasets. An efficient practice in this case is using pipelines. 

**Hint**: read notes about pdpipe library

First, load and examine the housing dataset.

In [9]:
housing

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1.059034e+06,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.002900,6.730821,3.09,40173.072174,1.505891e+06,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.865890,8.512727,5.13,36882.159400,1.058988e+06,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1.260617e+06,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,6.309435e+05,USNS Raymond\nFPO AE 09386
...,...,...,...,...,...,...,...
4995,60567.944140,7.830362,6.137356,3.46,22837.361035,1.060194e+06,USNS Williams\nFPO AP 30153-7653
4996,78491.275435,6.999135,6.576763,4.02,25616.115489,1.482618e+06,"PSC 9258, Box 8489\nAPO AA 42991-3352"
4997,63390.686886,7.250591,4.805081,2.13,33266.145490,1.030730e+06,"4215 Tracy Garden Suite 076\nJoshualand, VA 01..."
4998,68001.331235,5.534388,7.130144,5.44,42625.620156,1.198657e+06,USS Wallace\nFPO AE 73316


**a)** To start off, create a column called 'Price Range', which categorizes the prices in 3 ranges of low, medium, and high based on the intervals [0, 250000], [250000, 750000], [750000, inf]. Please use a helper function and follow proper style and naming conventions

In [10]:
### start code ###

### end code ###

**b)** Now, let's assume the Address is useless for the proposed machine learning model. Use a one stage pipeline to drop it.




In [11]:
### start code ###

### end code ###

**c)** Now, lets take advantage of the true power of pipelines. Let's say the requirement for data cleaning are to drop the number of bedrooms column, drop the area population column, and then use one-hot-encoding on the price range column. Use one pipeline to make these changes. 

In [12]:
### start code ###

### end code ###

Now, you should be able to see how these small modules of pipelines can be similar across datasets for cleaning specifications and how this practice can be lead very efficient, readable, and reusable code. 

##  **3. Efficient Practices to Visualize Code**
 Now you will take a look at efficient pracitces for graphing, inlcuding documenting, labeling, and scaling graphs based on the results needed.




**a)** Let's examine the relationship between the average income and the price of a house in the housing data. Use a scatter plot to plot price vs income for the first 100 entries in the data.




In [13]:
### start code ###

### end code ###

**b)** Now, looking at the correlation, you may see that the scales make it hard to recognize the details of the correlation. Include an appropriate scaled axis to make the correlation more aligned to the unity line.

In [14]:
### start code ###

### end code ###

**c)** Now, using appropiate practices of documentation and labeling, include labels for axes and title for the plot taking into account the scaling you did in part b.

In [15]:
### start code ###

### end code ###

##  **4. Cleaning Data Efficiently with Documentation**
In this problem, you will walk through cleaning a messy dataset, incorporating the efficient pracitices of writing and documenting code that you have learned so far. For documentation purposes, keep a dictionary of keys removed and modified that keeps tracks of all processes done on the dataset throughout the cleaning process. Use appropriate comments on all functions.

First, load and examine the books dataset.

In [16]:
cleaning_updates = {'remove': [], 'modify': []}
books

Unnamed: 0,Identifier,Edition Statement,Place of Publication,Date of Publication,Publisher,Title,Author,Contributors,Corporate Author,Corporate Contributors,Former owner,Engraver,Issuance type,Flickr URL,Shelfmarks
0,206,,London,1879 [1878],S. Tinsley & Co.,Walter Forbes. [A novel.] By A. A,A. A.,"FORBES, Walter.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12641.b.30.
1,216,,London; Virtue & Yorston,1868,Virtue & Co.,All for Greed. [A novel. The dedication signed...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12626.cc.2.
2,218,,London,1869,"Bradbury, Evans & Co.",Love the Avenger. By the author of “All for Gr...,"A., A. A.","BLAZE DE BURY, Marie Pauline Rose - Baroness",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 12625.dd.1.
3,472,,London,1851,James Darling,"Welsh Sketches, chiefly ecclesiastical, to the...","A., E. S.","Appleyard, Ernest Silvanus.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 10369.bbb.15.
4,480,"A new edition, revised, etc.",London,1857,Wertheim & Macintosh,"[The World in which I live, and my place in it...","A., E. S.","BROOME, John Henry.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS 9007.d.28.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8282,4158088,,London,1838,,"The Parochial History of Cornwall, founded on,...","GIDDY, afterwards GILBERT, Davies.","BOASE, Henry Samuel.|HALS, William.|LYSONS, Da...",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 10...
8283,4158128,,Derby,"1831, 32",M. Mozley & Son,The History and Gazetteer of the County of Der...,"GLOVER, Stephen - of Derby","NOBLE, Thomas.",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 10...
8284,4159563,,London,[1806]-22,T. Cadell and W. Davies,Magna Britannia; being a concise topographical...,"LYSONS, Daniel - M.A., F.R.S., and LYSONS (Sam...","GREGSON, Matthew.|LYSONS, Samuel - F.R.S",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 19...
8285,4159587,,Newcastle upon Tyne,1834,Mackenzie & Dent,"An historical, topographical and descriptive v...","Mackenzie, E. (Eneas)","ROSS, M. - of Durham",,,,,monographic,http://www.flickr.com/photos/britishlibrary/ta...,British Library HMNTS|British Library HMNTS 10...


**a)** First find and print all columns that contain any sort of empty values. However, if Author has empty values, replace those with the empty string. Update the cleaning_updates.


In [17]:
### start code ###

### end code ###

**b)** Now, drop all the columns that need to be removed, due to not having complete data.

In [18]:
### start code ###

### end code ###

**c)** Remove the unnecessary brackets in the title names efficiently. Remember to practice good documentation and use of helper methods for readable and reproducible code.

In [19]:
### start code ###

### end code ###

In [20]:
cleaning_updates

{'modify': [], 'remove': []}

Going through the cleaning process with reusable blocks of code as well as documenting any changes to the intial data set is a good practice of documentation in the data cleaning context.