# Assignment 4: EDA and Bootstrapping

## Objective

Statistics play a vital role in data science for two reasons. First, it can be used to gain a deep understanding of data. This process is known as **Exploratory Data Analysis (EDA)**. Second, it can be used to infer the relationship between a sample and the population. This process is known as **inference**. In this assignment, you will learn about EDA and statistical inference through the analysis of a very interesting dataset - [property tax report data](http://data.vancouver.ca/datacatalogue/propertyTax.htm). Specifically, you will learn the followings:

1. Be able to perform EDA on a single column (i.e., univariate analysis) 
2. Be able to perform EDA on multiple columns (i.e., multivariate analysis)
3. Be able to extract insights from visualizations
4. Be able to ask critical questions about data
5. Be able to estimate a population parameter based on a sample
6. Be able to use the bootstrap to quantify the uncertainty of an estimated value

In this assignment, you can use [pandas](https://pandas.pydata.org/) or PySpark to manipulate data, and use [matplotlib](https://matplotlib.org/) or [seaborn](https://seaborn.pydata.org) to make plots. 

You can download the datasets of each assignment from  http://tiny.cc/cmpt733-datasets.

## Part 1. EDA

Imagine you are a data scientist working at a real-estate company. In this week, your job is to analyze the Vancouver's housing price. You first download a dataset from [property_tax_report_2018.zip](http://tiny.cc/cmpt733-datasets/property_tax_report_2018.zip). The dataset contains information on properties from BC Assessment (BCA) and City sources in 2018.  You can find the schema information of the dataset from this [webpage](http://data.vancouver.ca/datacatalogue/propertyTaxAttributes.htm). But this is not enough. You still know little about the data. That's why you need to do EDA in order to get a better and deeper understanding of the data.

We first load the data as a DataFrame. To make this analysis more interesting, I added two new columns to the data: `CURRENT_PRICE` represents the property price in 2018; `PREVIOUS_PRICE` represents the property price in 2017. 

In [2]:
import pandas as pd

df = pd.read_csv("property_tax_report_2018.csv")

df['CURRENT_PRICE'] = df.apply(lambda x: x['CURRENT_LAND_VALUE']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)

df['PREVIOUS_PRICE'] = df.apply(lambda x: x['PREVIOUS_LAND_VALUE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)

print('read')

read


Now let's start the EDA process. 

**Hint.** For some of the following questions, I provided an example plot (see [link](A4-plots.html)). But note that you do not have to use the same plot design. In fact, I didn't do a good job to follow the *Principles of Visualization Design* presented in Lecture 4.  You should think about how to correct the bad designs in my plots.

### Question 1. Look at some example rows
Print the first five rows of the data:

In [2]:
# --- Write your code below ---
df.head(5)

Unnamed: 0,PID,LEGAL_TYPE,FOLIO,LAND_COORDINATE,ZONE_NAME,ZONE_CATEGORY,LOT,BLOCK,PLAN,DISTRICT_LOT,...,CURRENT_IMPROVEMENT_VALUE,TAX_ASSESSMENT_YEAR,PREVIOUS_LAND_VALUE,PREVIOUS_IMPROVEMENT_VALUE,YEAR_BUILT,BIG_IMPROVEMENT_YEAR,TAX_LEVY,NEIGHBOURHOOD_CODE,CURRENT_PRICE,PREVIOUS_PRICE
0,025-734-601,STRATA,750040000000.0,75004024,C-2,Commercial,25,,BCS498,2027,...,242000,2018,472000.0,238000.0,2003.0,2003.0,,3,834000,710000.0
1,029-700-868,STRATA,638183000000.0,63818250,CD-1 (464),Comprehensive Development,132,,EPS2983,200A,...,327000,2018,603000.0,329000.0,,,,13,1042000,932000.0
2,029-814-227,STRATA,170826000000.0,17082596,CD-1 (535),Comprehensive Development,25,,EPS3173,311,...,273000,2018,416000.0,273000.0,,,,12,780000,689000.0
3,029-918-731,STRATA,640194000000.0,64019406,IC-3,Light Industrial,40,26.0,EPS2425,200A,...,170000,2018,168000.0,170000.0,,,,13,397000,338000.0
4,017-393-400,STRATA,601115000000.0,60111496,CD-1 (233),Comprehensive Development,7,,LMS75,185,...,380000,2018,531000.0,385000.0,1991.0,1991.0,,27,1181000,916000.0


### Question 2. Get summary statistics

From the above output, you will know that the data has 28 columns. Please use the describe() function to get the summary statistics of each column.

In [3]:
# --- Write your code below ---
df.describe()

Unnamed: 0,FOLIO,LAND_COORDINATE,TO_CIVIC_NUMBER,CURRENT_LAND_VALUE,CURRENT_IMPROVEMENT_VALUE,TAX_ASSESSMENT_YEAR,PREVIOUS_LAND_VALUE,PREVIOUS_IMPROVEMENT_VALUE,YEAR_BUILT,BIG_IMPROVEMENT_YEAR,TAX_LEVY,NEIGHBOURHOOD_CODE,CURRENT_PRICE,PREVIOUS_PRICE
count,205346.0,205346.0,204731.0,205346.0,205346.0,205346.0,203042.0,203042.0,194899.0,194905.0,0.0,205346.0,205346.0,203042.0
mean,498432200000.0,49843220.0,2355.494566,1862369.0,400692.3,2018.0,1695359.0,387500.9,1979.969641,1987.35409,,16.524159,2263062.0,2082860.0
std,247937200000.0,24793720.0,1947.760697,10742590.0,4148662.0,0.0,9646130.0,4236152.0,29.419729,19.839132,,9.052394,12587260.0,11318190.0
min,19632060000.0,1963206.0,1.0,0.0,0.0,2018.0,0.0,0.0,1800.0,200.0,,1.0,1.0,1.0
25%,210792000000.0,21079190.0,948.0,468000.0,95300.0,2018.0,384000.0,94700.0,1965.0,1975.0,,9.0,653000.0,567000.0
50%,612236000000.0,61223630.0,1777.0,1057000.0,183000.0,2018.0,944000.0,181000.0,1990.0,1992.0,,16.0,1278000.0,1201000.0
75%,688277000000.0,68827740.0,3290.0,1692000.0,295000.0,2018.0,1680000.0,288000.0,2002.0,2002.0,,25.0,1984000.0,1937000.0
max,845313000000.0,84531340.0,9295.0,3516727000.0,611798000.0,2018.0,3319471000.0,626232000.0,2015.0,2015.0,,30.0,3516727000.0,3319471000.0


Please look at the above output carefully, and make sure that you understand the meanings of each row (e.g., std, 25% percentile).

### Question 3. Examine missing values

Now we are going to perform EDA on a single column (i.e., univariate analysis). We chose `YEAR_BUILT`, which represents in which year a property was built.  We first check whether the column has any missing value. 

In [4]:
# --- Write your code below ---
# Print the percentage of the rows whose YEAR_BUILT is missing.
import pandas as pd

total = df['YEAR_BUILT'].shape[0]
# print(total)

missing_count = df['YEAR_BUILT'].isnull().sum()
# print(missing_count)

percentage_missing = (missing_count / total) * 100
print("Percentage of rows whose 'YEAR_BUILT'values are missing: ", percentage_missing)

Percentage of rows whose 'YEAR_BUILT'values are missing:  5.087510835370546


Missing values are very common in real-world datasets. In practice, you should always be aware of the impact of the missing values on your downstream analysis results.

### Question 4.  Plot a line chart

We now start investigating the values in the `YEAR_BUILT` column.  Suppose we want to know: "How many properties were built in each year (from 1900 to 2018)?" Please plot a line chart to answer the question.

In [10]:
# --- Write your code below ---
import plotly.graph_objs as go
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

df2 = df[(df['YEAR_BUILT']>=1900) & (df['YEAR_BUILT']<=2018)].YEAR_BUILT.unique()
df1 = df[(df['YEAR_BUILT']>=1900) & (df['YEAR_BUILT']<=2018)].groupby(['YEAR_BUILT']).size()

df2 = np.sort(df2)
# print(len(df2))
# print(len(df1))

# Plotly in offline mode
init_notebook_mode(connected=True)

# Create the plot
trace0 = go.Scatter(
        x = df2,
        y = df1.values,
        text = 'Properties',
        name = 'No. of prop',
        mode = 'lines+markers',
        marker=dict(
            color='black',
        ),
        line = dict(
            color = ('turquoise'),
            width = 3)
    )

# Layout for the plot
layout = dict(title = 'Number of Properties Built per Year(ranging from 1900 to 2018)',
              xaxis = dict(title = 'Year'),
              yaxis = dict(title = 'Number of Properties'),
              showlegend=True,
              legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
            )

# Add data to the plot
data = [trace0]

# Combine data and layout together into a single figure
fig = dict(data=data, layout=layout)

# IPython notebook- Plot
iplot(fig, filename='line-plot')

Please write down the **two** most interesting findings that you draw from the plot. For example, you can say: <font color='blue'>"Vancouver has about 6300 properties built in 1996 alone, which is more than any other year"</font>. For each finding, please write <font color="red">no more than 2 sentences</font>.

**Findings**
1. The years between 1988 and 2014 (including both) have a good number of propeties built in their period (value is approximately lower bounded at 3000), except for the year 2001 where the number of properties built has dropped to '1533' (the lowest in the range).
2. The years 1910 and 1912 were the only ones to cross 3000 properties built per year (3070 and 3854 properties built in 1910 and 1912 respectively) in the period between 1900-1987.

### Question 5. Plot a bar chart

Next, we want to find that, between 1900 and 2018, which years have the most number of properties been built? Plot a bar chart to show the top 20 years. 

In [11]:
# --- Write your code below ---
import plotly.graph_objs as go
import numpy as np
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

year = []

# Prepare data for plotting
df3 = -np.sort(-df1)
df_top20 = df3[0:20]

for i in df_top20:
    for j in df2:
        if df1[j] == i:
            year.append(int(j))
# print(year)
    
# Differentiate bars based on year built- use colours to differentiate
count = 0
name = ['1950-2000', '>2000', '<1950']
year_less = []
year_mid = []
year_greater = []
serial_num_less = []
serial_num_mid = []
serial_num_above = []

for i in year:
    if (i<=1950):
        year_less.append(df_top20[count])
        serial_num_less.append(count)
    elif (i>1950) & (i<=2000):
        year_mid.append(df_top20[count])
        serial_num_mid.append(count)
    else:
        year_greater.append(df_top20[count])
        serial_num_above.append(count)
    count += 1

serial_num_year = []
for i in range(len(year)):
    serial_num_year.append(i)


# Creating the plot- Bar chart
init_notebook_mode(connected=True)

trace0 = go.Bar(
        y = year_less,
        x = serial_num_less,
        name = 'Years before 1950(incl)',
        text = 'Properties built',
        marker=dict(
            color='midnightblue',
        ),
        opacity=0.5
    )

trace1 = go.Bar(
        y = year_mid,
        x = serial_num_mid,
        name = 'Years between 1950 and 2000(incl)',
        text = 'Properties built',
        marker=dict(
            color='orange',
        ),
        opacity=0.65
    )

trace2 = go.Bar(
        y = year_greater,
        x = serial_num_above,
        name = 'Years after 2000',
        text = 'Properties built',
        marker=dict(
            color='salmon',
        ),
        opacity=0.65
    )

# Layout for the plot
layout = dict(title = 'Number of Properties Built in Top 20 Years(years that have seen maximum properties being built)',
              xaxis=go.layout.XAxis(title='Year',
                        showline=True,
                        mirror=True,
                        tick0 =1900,
                        tickmode='array',
                        ticktext=year,
                        tickvals=serial_num_year,
                        ticks='outside',
                        linecolor='#636363',
                        linewidth=6
                    ),
              yaxis = dict(title = 'No. of Properties',
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        linecolor='#636363',
                        linewidth=6
                    ),
              showlegend=True,
              legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
              )

# Add our data to the plot
data = [trace0, trace1, trace2]

# Combine data and layout into a single figure
fig = dict(data=data, layout=layout)

# IPython notebook- plot
iplot(fig, filename='bar-plot')

Please write down the **two** most interesting findings that you draw from the plot. 

**Findings**
1. It can be seen from the plot that there is only one year falling before 1950, the year 1912- that has had above 2000 properties being built.
2. The period between 1950 and 2000 have the most years (10 different years) falling in their range for having had properties built to be over 2000. And, the top 5 years fall in the range 1994-2009, which is again crowded around the beginning of 20th century.

### Question 6. Plot a histogram

What's the distribution of the number of properties built between 1900 and 2018? Please plot a histogram to answer this question.

In [12]:
# --- Write your code below ---
import plotly.plotly as py
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import numpy as np

df4 = df[(df['YEAR_BUILT']>=1900) & (df['YEAR_BUILT']<=2018)].groupby(['YEAR_BUILT']).size()
# print(df4)

# Compute bin value
minimum = np.amin(df4.values)
maximum = np.amax(df4.values)
bins = int((maximum-minimum)/25)

# Create the plot data
data = [go.Histogram(x=df4,
                histnorm='percent',
                name='Frequency- Percentage',
                text='percent fall in this range',
                xbins=dict(size=bins
                ),
                marker=dict(
                    color='tomato',
                    line=dict(
                        color='rgb(8,48,107)',
                        width=2
                    )
                ),
                opacity=0.65)]

# Create Layout for the plot
layout = go.Layout(
    title='Number of Properties Built VS the Year Built',
    xaxis=dict(
        title='No. of Properties Built (interval range is for every '+str(bins)+' increase in properties built)',
        zerolinecolor='#969696',
        zerolinewidth=2,
        zeroline=True,
        linecolor='#636363',
        linewidth=4,
        showline=True,
    ),
    yaxis=dict(
        title='Frequency (in percent)',
        showgrid=True,
        zeroline=True,
        showline=True,
        zerolinecolor='#969696',
        zerolinewidth=2,
        linecolor='#636363',
        linewidth=4,
    ),
    showlegend=True,
    legend=dict(bgcolor='lightgray',
            bordercolor='gray',
            borderwidth=2,
        )
    )

# Show the plot
init_notebook_mode(connected=True)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='histogram')

Please write down the **two** most interesting findings that you draw from the plot. 

**Findings**
1. Interval '753-1003' has the highest percent (which also means count) frequency-15.517% of properties being built in that range for the span of years between 1900-2018.
2. 0-1000 is the most frequent of all ranges. Meaning that there are more number of properties built falling in the range of 0-1000 between 1900-2018. Also, between 2000-3000 the frequency observed is fairly constant across the sub-intervals (~1.74-2.5).

### Question 7. Make a scatter plot

Suppose we are interested in those years which built more than 2000 properties. Make a scatter plot to examine whether there is a relationship between the number of built properties and the year?

In [13]:
# --- Write your code below ---
# Plot may take some time to appear as the data points are vast in number- pls hold on 
# Else Run this cell alone separately!!

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Prepare data for plotting
df_above = df1
count = 0
year_above = []
prop_above = []

for i in df_above:
    if i > 2000:
        year_above.append(df2[count])
        prop_above.append(i)   
    count += 1
# print(year_above)

# Create plot data
data = [go.Scatter(
    text = 'Year',
    x = prop_above,
    y = year_above,
    name = 'Years for which properties built >2000',
    mode = 'markers',
    marker = dict(
        size = 10,
        color = 'maroon',
        line = dict(
            width = 1.5,
            color = 'pink'
        )
    ),
    opacity = 0.7
)]

# Create/Design layout for the plot
layout = dict(title = 'Plot of Years in which more than 2000 properties were built',
              yaxis = dict(title='Year Built',
                        tick0=1900,
                        showgrid=False,
                        linecolor='#636363',
                        linewidth=2
                    ),
              xaxis = dict(title='No. of properties',
                        showgrid=False,
                        showline=True,
                        linecolor='#636363',
                        linewidth=2
                    ),
              showlegend=True,
              legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
             )

# Plot the data using plotly offline mode
init_notebook_mode(connected=True)
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='scatter')

Please write down the **two** most interesting findings that you draw from the plot. 

**Findings**
1. There are only two years (1910 and 1912) below 1950's that have above 2000 properties built and both are in the range of 3000-4000 properties. Also, there are only 5 years that have above 2000 properties built before 1980's.
2. The consecutive years 1994 and 1995 have the maximum number of properties built out of the entire span between 1900-2018.

### Question 8. PDF and CDF

Can you believe that you have already drawn 8 interesting findings by exploring a single column! This is the power of EDA combined with critical thinking. Now we are moving to multivariate analysis.

Suppose you want to compare the housing price between this year and last year, i.e., CURRENT_PRICE vs. PREVIOUS_PRICE. 
You can plot their distributions, and make the comparison. There are two ways to define a distribution: [Probabilistic Distribution Function](https://en.wikipedia.org/wiki/Probability_density_function) (PDF) and [Cumulative Distribution Function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) (CDF). 

In the following, please make two plots and put them side-by-side.  
* In the first plot, use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE.
* In the second plot, use histograms to plot the cumulative distributions of CURRENT_PRICE and PREVIOUS_PRICE.

There are a few properties which are way more expensive than the others. For both plots, please exclude those properties by setting `xlim` = (0, 5Million).

In [14]:
# --- Write your code below ---
# Plot may take some time to appear as the data points are vast in number- pls hold on!
# Else Run this cell alone separately!! PLEASE RUN FIRST CELL BEFORE THIS- to get df value initialized.
from plotly import tools
import plotly.plotly as py
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
tools.set_credentials_file(username='savitaavenkat', api_key='VZAuJ6lrXjkQPTW5qv6K')

# Prepare data for plotting
df6 = df[(df['CURRENT_PRICE']>0) & (df['PREVIOUS_PRICE']>0)].CURRENT_PRICE.dropna()
df7 = df[(df['CURRENT_PRICE']>0) & (df['PREVIOUS_PRICE']>0)].PREVIOUS_PRICE.dropna()

# Design the way data needs to be plotted on to the figure
trace0 = go.Histogram(x=df6, 
                    xbins=dict(start=np.min(df6), size=90, end=np.max(df6)),
                    name='current price distribtuion- PDF plot',
                    marker=dict(color='#F64E8B'),
                    opacity = 0.75
                    )

trace1 = go.Histogram(x=df7, 
                    xbins=dict(start=np.min(df7), size=90, end=np.max(df7)),
                    name='previous price distribution- PDF plot',
                    marker=dict(color='lightgreen'),
                    opacity = 0.75
                    )
trace2 = go.Histogram(x=df6, 
                    histnorm='probability',
                    cumulative=dict(enabled=True),
                    name='current price- CDF plot',
                    marker=dict(color='orange'),
                    opacity = 0.75
                    )

trace3 = go.Histogram(x=df7, 
                    histnorm='probability',
                    cumulative=dict(enabled=True),
                    name='previous price- CDF plot',
                    marker=dict(color='lightgreen'),
                    opacity = 0.75
                    )

# Create layout for the plot
layout = go.Layout(
        title="PDF, CDF plots of Current Price vs Previous Price",
        xaxis=dict(title='price of the property (in million)',
            range=[0, 5000000],
            ),
        yaxis=dict(title='Number of Properties (x10^1)',
            ),
        showlegend=True,
        barmode='overlay',
        legend=dict(bgcolor='lightgray',
            bordercolor='gray',
            borderwidth=2
        )
    )

# Create subplots
fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('PDF plot', 'CDF plot'))

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)

fig['layout']['xaxis1'].update(range=[0, 5000000], showgrid=False, type='linear')
fig['layout']['xaxis2'].update(title='price of the property (in million)', range=[0, 5000000], showgrid=False, type='linear')

fig['layout']['yaxis1'].update(showgrid=False, type='linear')
fig['layout']['yaxis2'].update(title='Probability', showgrid=False, type='linear')

# Finally plot the figure
init_notebook_mode(connected=True)
fig['layout'].update(layout, height=450, width=2000)

# py.plot(fig, filename='histogram-prob-dist-1')

py.iplot(fig, filename='histogram-prob-dist-1')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]




Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~savitaavenkat/0 or inside your plot.ly account where it is named 'histogram-prob-dist-1'


Please write down the **two** most interesting findings that you draw from the plots. 

**Findings**
1. It can be seen from the PDF(probability density function) plot that initally the number of properties for previous price is higher than that for the current price for the range between approximately 0-0.6million. However, post 0.6million (approx) , above that range number of properties is higher for current price.
2. The CDF plot depicts that after approximatley 2million price range, the probability that a property built with current price or previous price is increasing with the value of the property. And, as the price approaches 5 million- the probability that a property built of that value being either current price or previous price almost becomes 1.

### Question 9. Use EDA to answer an interesting question (1)

In the above plots, we found that the overall housing price has increased, but we do not which type of property has increased more. 

Now we add another variable `LEGAL_TYPE` (e.g., STRATA, LAND) to the analysis, and consider three variables (`LEGAL_TYPE`, `CURRENT_PRICE`, `PREVIOUS_PRICE`) in total. 

In the following, please make two plots and put them side-by-side.
* In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for `LEGAL_TYPE` = "STRATA".
* In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for `LEGAL_TYPE` = "LAND".

In [15]:
# --- Write your code below ---
# Plot may take some time to appear as the data points are vast in number- pls hold on 
# Else Run this cell alone separately!! PLEASE RUN FIRST CELL BEFORE THIS- to get df value initialized.

from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
tools.set_credentials_file(username='savitaavenkat', api_key='VZAuJ6lrXjkQPTW5qv6K')

# Prepare data for plotting
df9 = df[(df['CURRENT_PRICE']>0) & (df['PREVIOUS_PRICE']>0) & (df['LEGAL_TYPE']=="STRATA")].CURRENT_PRICE.dropna()
df10 = df[(df['CURRENT_PRICE']>0) & (df['PREVIOUS_PRICE']>0) & (df['LEGAL_TYPE']=="STRATA")].PREVIOUS_PRICE.dropna()
df11 = df[(df['CURRENT_PRICE']>0) & (df['PREVIOUS_PRICE']>0) & (df['LEGAL_TYPE']=="LAND")].CURRENT_PRICE.dropna()
df12 = df[(df['CURRENT_PRICE']>0) & (df['PREVIOUS_PRICE']>0) & (df['LEGAL_TYPE']=="LAND")].PREVIOUS_PRICE.dropna()

# Design the way data needs to be plotted on to the figure
trace0 = go.Histogram(x=df9, 
                     xbins=dict(start=np.min(df9), size=90, end=np.max(df9)),
                     name='current price distribtuion when legal type is strata',
                     marker=dict(color='#F64E8B'),
                     opacity = 0.75
                    )

trace1 = go.Histogram(x=df10, 
                     xbins=dict(start=np.min(df10), size=90, end=np.max(df10)),
                     name='previous price distribution when legal type is strata',
                     marker=dict(color='lightgreen'),
                     opacity = 0.75
                    )
trace2 = go.Histogram(x=df11, 
                     histnorm='probability',
                     xbins=dict(start=np.min(df11), size=500, end=np.max(df11)),
                     name='current price distribtuion when legal type is land',
                     marker=dict(color='orange'),
                     opacity = 0.75
                    )

trace3 = go.Histogram(x=df12, 
                     histnorm='probability',
                     xbins=dict(start=np.min(df12), size=500, end=np.max(df12)),
                     name='previous price distribution when legal type is land',
                     marker=dict(color='gray'),
                     opacity = 0.75
                    )

# Create layout for the plot
layout = go.Layout(
        title="Probability Distribution of Current Price vs Previous Price based on legal type",
        xaxis=dict(title='price of the property (in million)',
            range=[0, 5000000],
            ),
        yaxis=dict(title='Number of Properties (x10^1)',
            ),
        showlegend=True,
        barmode='overlay',
        legend=dict(bgcolor='lightgray',
            bordercolor='gray',
            borderwidth=2
        )
    )

# Create subplots
fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Plot based on Strata', 'Plot based on Land'))

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 2)

fig['layout']['xaxis1'].update(range=[0, 5000000], showgrid=False, type='linear')
fig['layout']['xaxis2'].update(title='price of the property (in million)', range=[0, 5000000], showgrid=False, type='linear')

fig['layout']['yaxis1'].update(showgrid=False, type='linear')
fig['layout']['yaxis2'].update(title='Probability (x10^3)', showgrid=False, type='linear')

# Finally plot the figure
fig['layout'].update(layout, height=450, width=2000)

py.iplot(fig, filename='histogram-prob-dist')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]




Woah there! Look at all those points! Due to browser limitations, the Plotly SVG drawing functions have a hard time graphing more than 500k data points for line charts, or 40k points for other types of charts. Here are some suggestions:
(1) Use the `plotly.graph_objs.Scattergl` trace object to generate a WebGl graph.
(2) Trying using the image API to return an image instead of a graph URL
(3) Use matplotlib
(4) See if you can create your visualization with fewer data points




Please write down the **two** most interesting findings that you draw from the plots. 

**Findings**
1. Initially it can be seen from the strata plot that the previous land price has higher number of strata type properties built in the range of approx, 0-0.6 million. But, there is a transition after the point- approx above 0.6 million. This depicts that Current price has more number of strata type properties than previous price in the range above 0.6million (approx).
2. The second plot based on land and with probability on the yaxis, depicts that the probability of a property built being of legal type 'land', which has a price falling in the 'mid range' of 1-2Million is almost 1 for both previous and current price. Further, in the price range of 2-3Million we can see that- many a times when current price has a higher probability over previous price.

### Question 10. Use EDA to answer interesting questions (2)

Although the housing price of the entire Vancouver area is increasing, there might be some areas whose housing price is decreasing. To answer this question, we need to consider another column -- `PROPERTY_POSTAL_CODE`.

`PROPERTY_POSTAL_CODE` (e.g., "V5A 1S6") is a six-character string with a space separating the third and fourth characters. We use the first three characters to represent an *area*. 

We first filter out the areas which have less than 10 properties. For each of the remaining areas, we calculate the percentage of the properties whose price has decreased compared to the last year. For example, if an area "V5A" has 50 properties, and 30 of them have decreased, then the percentage is 60%.

Please write code to find the top-10 areas with the highest percentages. Create a bar chart to visualize them. 

In [8]:
# --- Write your code below ---
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF

def compute_perdec(df):
    df8 = df.copy()
    df8 = df8.dropna(subset=['PROPERTY_POSTAL_CODE'])
    df8['PROPERTY_POSTAL_CODE'] = df8['PROPERTY_POSTAL_CODE'].str.split(' ').str[0]
    df8 = df8.groupby(df8['PROPERTY_POSTAL_CODE']).filter(lambda x: len(x) >= 10)
    area_full = df8.groupby(df8['PROPERTY_POSTAL_CODE']).PROPERTY_POSTAL_CODE.size()
    area = df8[(df8['CURRENT_PRICE']-df['PREVIOUS_PRICE'])<0].groupby(df8['PROPERTY_POSTAL_CODE']).PROPERTY_POSTAL_CODE.size()

    percentage_decreased = (area/area_full)
    # print(percentage_decreased)

    index = area.index
    # print(index)

    sorted_top10 = -np.sort(-percentage_decreased)[0:10]*100
    # print(sorted_top10, percentage_decreased[0])

    top10= []
    for i in range(len(sorted_top10)):
        for j in range(len(percentage_decreased)):
            if sorted_top10[i] == percentage_decreased[j]*100:
                top10.append(percentage_decreased.index[j])
    # print(top10)
    
    return sorted_top10, top10
    
def plot(sorted_top10, top10):
    # Create plot data
    trace0 = go.Bar(
        y=[sorted_top10[0]],
        x=[0.1],
        name=str(top10[0]),
        text='Percent decrease in property price',
        marker=dict(
            color='palevioletred',
        ),
        opacity=0.85,
        width=0.1
    )

    trace1 = go.Bar(
        y=[sorted_top10[1]],
        x=[0.3],
        name=str(top10[1]),
        text='Percent decrease in property price',
        marker=dict(
            color='rosybrown',
        ),
        opacity=0.85,
        width=0.1
    )

    trace2 = go.Bar(
        y=[sorted_top10[2]],
        x=[0.5],
        name=str(top10[2]),
        text='Percent decrease in property price',
        marker=dict(
            color='salmon',
        ),
        opacity=0.85,
        width=0.1
    )

    trace3 = go.Bar(
        y=[sorted_top10[3]],
        x=[0.7],
        name=str(top10[3]),
        text='Percent decrease in property price',
        marker=dict(
            color='pink',
        ),
        opacity=0.85,
        width=0.1
    )

    trace4 = go.Bar(
        y=[sorted_top10[4]],
        x=[0.9],
        name=str(top10[4]),
        text='Percent decrease in property price',
        marker=dict(
            color='lightskyblue',
        ),
        opacity=0.85,
        width=0.1
    )

    trace5 = go.Bar(
        y=[sorted_top10[5]],
        x=[1.1],
        name=str(top10[5]),
        text='Percent decrease in property price',
        marker=dict(
            color='lightgreen',
        ),
        opacity=0.85,
        width=0.1
    )

    trace6 = go.Bar(
        y=[sorted_top10[6]],
        x=[1.3],
        name=str(top10[6]),
        text='Percent decrease in property price',
        marker=dict(
            color='paleturquoise',
        ),
        opacity=1,
        width=0.1
    )

    trace7 = go.Bar(
        y=[sorted_top10[7]],
        x=[1.5],
        name=str(top10[7]),
        text='Percent decrease in property price',
        marker=dict(
            color='lightseagreen',
        ),
        opacity=0.85,
        width=0.1
    )

    trace8 = go.Bar(
        y=[sorted_top10[8]],
        x=[1.7],
        name=str(top10[8]),
        text='Percent decrease in property price',
        marker=dict(
            color='lavender',
        ),
        opacity=0.85,
        width=0.1
    )

    trace9 = go.Bar(
        y=[sorted_top10[9]],
        x=[1.9],
        name=str(top10[9]),
        text='Percent decrease in property price',
        marker=dict(
            color='purple',
        ),
        opacity=0.6,
        width=0.1
    )

    # Create the layout for the plot
    layout = dict(title = 'Top 10 areas which have the highest drop in property price',
              xaxis=go.layout.XAxis(title='Area Code',
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        tick0=0,
                        tickmode='array',
                        ticktext=top10,
                        tickvals=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7,1.9],
                        ticks='outside',
                        zerolinecolor='#969696',
                        zerolinewidth=2,
                        linecolor='#636363',
                        linewidth=6,
                        range=[0, 2]
                    ),
              yaxis = dict(title = 'Percentage Decrease (%)',
                        showgrid=True,
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        gridcolor='#bdbdbd',
                        gridwidth=1,
                        zerolinecolor='#969696',
                        zerolinewidth=2,
                        linecolor='#636363',
                        linewidth=6,
                        range = [0, 100]
                    ),
              showlegend=True,
              legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
              )
    
    # Combine the plot data with the layout designed for it, into a single figure
    data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9]
    fig = dict(data=data, layout=layout)

    # IPython notebook- show plot
    iplot(fig, filename='bar-plot')

def main():
    sorted_top10, top10 = compute_perdec(df)
    plot(sorted_top10, top10)

# call the methods
main()


Boolean Series key will be reindexed to match DataFrame index.



Please write down the **two** most interesting findings that you draw from the plot. 

**Findings**
1. The top 4 area codes all begin with 'V6' (the 3rd letter/alphabet as well belongs to the latter half of the alphabetical series), which can potentially indicate that the broader area that is denoted by V6 is facing a drop in property  price due to some common factor. And, 'V6S' has the most percent drop in property price at 84.3%.
2. Except for the top  2 areas, the remaining 8 can be grouped into groups of 4 each- with V6N, V6R, V5W & V5P falling into one group that has approx percent drop around 61% and with V5M, V5S, V5X and V6M falling into the other with percent drop around approx 40.2%.

### Question 11. Come up with your own question.

*You need to complete the following three tasks.*

Firstly, please come up with an interesting question on your own (like Q9 and Q10). 

**A short description of the question:**
In the last plot we have explored the percentage decrease in property price based on 'LEGAL_TYPE', but is LEGAL_TYPE the only means by which we can explore the type of property based on the difference between CURRENT_PRICE and PREVIOUS_PRICE? Certainly not!

Let us now explore the Percentage Increase in Property Value based on the 'ZONE_CATEGORY'. If we carefully group the data, we will get to know that there are 9 different 'ZONE_CATEGORY' values. Let us find the percentage increase in property value for each of the 9 distinct categories and plot the same (drop NaN values). 

Why percentage increase here and not decrease like we did for 'LEGAL_TYPE'?: good question, the answer is because under careful exploration one can find that the perect increase in property value is significatntly high, which means the decrease is obviously the opposite! So, this calls for a better way to represent the data points, I feel!

Secondly, please write code so that the output of your code can answer the question.

In [9]:
# --- Write your code below ---
from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

def compute_perinc(df):
    df13 = df.copy()
    df13 = df13.dropna(subset=['ZONE_CATEGORY'])
    zone_full = df13.groupby(df13['ZONE_CATEGORY']).ZONE_CATEGORY.size()
    zone_increase = df13[(df13['CURRENT_PRICE']-df13['PREVIOUS_PRICE'])>0].groupby(df13['ZONE_CATEGORY']).ZONE_CATEGORY.size()
    percentage_increased = (zone_increase/zone_full)*100
#     print(percentage_increased)
#     print(zone_increase)

    index = zone_increase.index
#     print(index)
    
    return  index, percentage_increased

def plot(index, percentage_increased):
    # Create plot data
    trace0 = go.Bar(
        y=[percentage_increased[0]],
        x=[0.1],
        name=str(index[0]),
        text='Percent increase in property price',
        marker=dict(
            color='palevioletred',
        ),
        opacity=0.85,
        width=0.1
    )

    trace1 = go.Bar(
        y=[percentage_increased[1]],
        x=[0.3],
        name=str(index[1]),
        text='Percent increase in property price',
        marker=dict(
            color='rosybrown',
        ),
        opacity=0.85,
        width=0.1
    )

    trace2 = go.Bar(
        y=[percentage_increased[2]],
        x=[0.5],
        name=str(index[2]),
        text='Percent increase in property price',
        marker=dict(
            color='salmon',
        ),
        opacity=0.85,
        width=0.1
    )

    trace3 = go.Bar(
        y=[percentage_increased[3]],
        x=[0.7],
        name=str(index[3]),
        text='Percent increase in property price',
        marker=dict(
            color='pink',
        ),
        opacity=0.85,
        width=0.1
    )

    trace4 = go.Bar(
        y=[percentage_increased[4]],
        x=[0.9],
        name=str(index[4]),
        text='Percent increase in property price',
        marker=dict(
            color='lightskyblue',
        ),
        opacity=0.85,
        width=0.1
    )

    trace5 = go.Bar(
        y=[percentage_increased[5]],
        x=[1.1],
        name=str(index[5]),
        text='Percent increase in property price',
        marker=dict(
            color='lightgreen',
        ),
        opacity=0.85,
        width=0.1
    )

    trace6 = go.Bar(
        y=[percentage_increased[6]],
        x=[1.3],
        name=str(index[6]),
        text='Percent increase in property price',
        marker=dict(
            color='paleturquoise',
        ),
        opacity=1,
        width=0.1
    )

    trace7 = go.Bar(
        y=[percentage_increased[7]],
        x=[1.5],
        name=str(index[7]),
        text='Percent increase in property price',
        marker=dict(
            color='lightseagreen',
        ),
        opacity=0.85,
        width=0.1
    )

    trace8 = go.Bar(
        y=[percentage_increased[8]],
        x=[1.7],
        name=str(index[8]),
        text='Percent increase in property price',
        marker=dict(
            color='lavender',
        ),
        opacity=0.85,
        width=0.1
    )

    # Create the layout for the plot
    layout = dict(title = 'Percentage increase in Property price based on zone category',
              xaxis=go.layout.XAxis(title='Zone Category',
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        tick0=0,
                        tickmode='array',
                        ticktext=index,
                        tickvals=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7],
                        ticks='outside',
                        zerolinecolor='#969696',
                        zerolinewidth=2,
                        linecolor='#636363',
                        linewidth=6,
                        range=[0, 2],
                        tickangle=22,
                        tickfont=dict(
                            family='Old Standard TT, serif',
                            size=11.5,
                        ),
                    ),
              yaxis = dict(title = 'Percentage Increase (%)',
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        gridcolor='#bdbdbd',
                        gridwidth=1,
                        zerolinecolor='#969696',
                        zerolinewidth=2,
                        linecolor='#636363',
                        linewidth=6,
                        range = [0, 100]
                    ),
              showlegend=True,
              legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
              )

    # Combine the plot data with the layout designed for it, into a single figure
    data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8]
    fig = dict(data=data, layout=layout)

    # IPython notebook- show plot
    iplot(fig, filename='bar-plot')

def main():
    index, percentage_increased = compute_perinc(df)
    plot(index, percentage_increased)
    
# Call the methods
main()

Thirdly, please write the two most important findings.

**Findings**
1. Firstly, 'One Family Dwelling' has the least percentage increase for property value- at 42.86842%. Followed by 'Two Family Dwelling', which has 70.74% increase.
2. Secondly, the zone category 'Historic Area' has seen the most percentage increase in property value for the given dataset. Closely followed by 'Industrial' zone category at 97.732 percentage.

## Part 2. Bootstrapping

In Part 1, we run our analysis over the full dataset. In reality, however, you may not be that lucky. It is more often than not that you can only collect a sample of the data. Whenever you derive a conclusion from a sample (e.g., The Vancouver's housing price has increased by 10\% since last year), you should ALWAYS ask yourself: <font color="blue">"CAN I TRUST IT?"</font>. In other words, you want to know that if the same analysis was conducted on the full data, would the same conclusion be derived? In Part 2, you will learn how to use bootstrapping to answer this question. 

Please download the sample dataset [property_tax_report_2018_sample.zip](http://tiny.cc/cmpt733-datasets/property_tax_report_2018_sample.zip), and load it as a DataFrame. 

In [10]:
df_sample = pd.read_csv("property_tax_report_sample.csv")

df_sample['CURRENT_PRICE'] = df_sample.apply(lambda x: x['CURRENT_LAND_VALUE']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)

df_sample['PREVIOUS_PRICE'] = df_sample.apply(lambda x: x['PREVIOUS_LAND_VALUE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)

df_sample = df_sample[df_sample['LEGAL_TYPE'] == 'STRATA']

### Task 1. Analysis Result Without Bootstrapping

Please compute the median of PREVIOUS_PRICE and CURRENT_PRICE, respectively, and compare them in a bar chart.

In [11]:
# --- Write your code below ---
from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

def compute_median():
    # Compute median directly on whole sample data without bootstrapping
    median_cp = df_sample.CURRENT_PRICE.median(skipna=True)/1000000 #(to get in million)
    median_pp = df_sample.PREVIOUS_PRICE.median(skipna=True)/1000000 #(to get in million)
    return median_cp, median_pp

def plot(median_cp, median_pp):
    # Create plot data
    trace0 = go.Bar(
            y=[median_cp],
            x=[0.75],
            name='Current Year',
            text='median',
            marker=dict(
                color='coral',
            ),
            opacity=0.85,
            width=0.2
        )

    trace1 = go.Bar(
            y=[median_pp],
            x=[0.25],
            text='median',
            name='Previous Year',
            marker=dict(
                color='maroon',
            ),
            opacity=0.85,
            width=0.2
        )

    # Create the layout for the plot
    layout = dict(title = 'Median of Current Price and Previous Price from Sample Data- without bootstrapping',
                xaxis=go.layout.XAxis(title='Price Year',
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        tick0=0,
                        tickmode='array',
                        ticktext=['Current Year Price', 'Previous Year Price'],
                        tickvals=[0.75,0.25],
                        ticks='outside',
                        zerolinecolor='#969696',
                        zerolinewidth=2,
                        linecolor='#636363',
                        linewidth=6,
                        range=[0, 1]
                    ),
                  yaxis = dict(title = 'Median value (in million)',
                        showgrid=True,
                        zeroline=True,
                        showline=True,
                        mirror=True,
                        gridcolor='#bdbdbd',
                        gridwidth=1,
                        zerolinecolor='#969696',
                        zerolinewidth=2,
                        linecolor='#636363',
                        linewidth=6,
                        range = [0, 1]
                    ),
                  showlegend=True,
                  legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
                )

    # Combine the plot data with the layout designed for it, into a single figure
    data = [trace0, trace1]
    fig = dict(data=data, layout=layout)

    # IPython notebook- show plot
    iplot(fig, filename='bar-plot')

def without_bootstrapping():
    median_cp, median_pp = compute_median()
    plot(median_cp, median_pp)

# Perform median statistics without bootstraping
without_bootstrapping()

### Task 2. Analysis Result With Bootstrapping

From the above chart, we find that the median of PREVIOUS_PRICE is about 0.6 M, and the median of CURRENT_PRICE is about 0.7 M. Since the numbers were obtained from the sample, <font color="blue">"CAN WE TRUST THESE NUMBERS?"</font> 

In the following, please implement the bootstrap by yourself, and compute a 95%-confidence interval for each number. [This document](./MIT18_05S14_Reading24.pdf) gives a good tutorial about the bootstrap. You can find the description of the algorithm in Section 7.

In [12]:
# --- Write your code below ---
import numpy as np
from plotly import tools
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

def bootstrap_sample(sample_length_cp, sample_length_pp, df_cp, df_pp, num_samples=10000):
    median_cp = []
    median_pp = []

    for i in range(num_samples):
        sample_cp = np.random.choice(df_cp, sample_length_cp)
        sample_pp = np.random.choice(df_pp, sample_length_pp)
        median_pp.append((np.median(sample_pp))/1000000)
        median_cp.append((np.median(sample_cp))/1000000)

    return(median_cp, median_pp)

def Confidence_interval(median):
    # confidence intervals
    alpha = 0.95
    percentile = ((1.0-alpha)/2.0) * 100
    below = max(0.0, np.percentile (median, percentile))
    percentile = (alpha+((1.0-alpha)/2.0)) * 100
    above = min(1.0, np.percentile(median, percentile))
    print('Current Price: %.1f%% confidence interval %.1f%% and %.1f%%' % (alpha*100, below*100, above*100))
    
def plot(median_sampled_cp, median_sampled_pp):
    # Plot bar chart to show the bootstrapped median value
    trace0 = go.Bar(
            y=[median_sampled_cp],
            x=[0.75],
            name='Current Year',
            text='median',
            marker=dict(
                color='coral',
            ),
            opacity=0.85,
            width=0.2
        )

    trace1 = go.Bar(
            y=[median_sampled_pp],
            x=[0.25],
            text='median',
            name='Previous Year',
            marker=dict(
                color='maroon',
            ),
            opacity=0.85,
            width=0.2
        )

    # Edit the layout
    layout = dict(title = 'Median of Current Price and Previous Price from Sample Data- with bootstrapping',
                  xaxis=go.layout.XAxis(title='Price Year',
                            zeroline=True,
                            showline=True,
                            mirror=True,
                            tick0=0,
                            tickmode='array',
                            ticktext=['Current Year Price', 'Previous Year Price'],
                            tickvals=[0.75,0.25],
                            ticks='outside',
                            zerolinecolor='#969696',
                            zerolinewidth=2,
                            linecolor='#636363',
                            linewidth=6,
                            range=[0, 1]
                        ),
                  yaxis = dict(title = 'Median value (in million)',
                            showgrid=True,
                            zeroline=True,
                            showline=True,
                            mirror=True,
                            gridcolor='#bdbdbd',
                            gridwidth=1,
                            zerolinecolor='#969696',
                            zerolinewidth=2,
                            linecolor='#636363',
                            linewidth=6,
                            range = [0, 1]
                        ),
                  showlegend=True,
                  legend=dict(bgcolor='lightgray',
                        bordercolor='gray',
                        borderwidth=2
                    )
                )

    data = [trace0, trace1]

    fig = dict(data=data, layout=layout)

    # IPython notebook
    iplot(fig, filename='bar-plot')

def bootstrap():
    sample_length_cp = len(df_sample.CURRENT_PRICE.dropna())
    sample_length_pp = len(df_sample.PREVIOUS_PRICE.dropna())
    num_samples=10000
    # print(sample_length_pp)
    # print(sample_length_cp)
    
    median_cp, median_pp = bootstrap_sample(sample_length_cp, sample_length_pp, df_sample.CURRENT_PRICE.dropna(), df_sample.PREVIOUS_PRICE.dropna(), num_samples)
    median_sampled_cp = round(np.median(median_cp),3)
    median_sampled_pp = round(np.median(median_pp),3)
    print(median_sampled_cp, median_sampled_pp)
    
    Confidence_interval(median_cp)
    Confidence_interval(median_pp)
    
    plot(median_sampled_cp, median_sampled_pp)


# Perform bootstrapping
bootstrap()
    

0.697 0.603
Current Price: 95.0% confidence interval 67.8% and 72.3%
Current Price: 95.0% confidence interval 58.4% and 62.1%


## Submission

Complete the code in this [notebook](https://github.com/sfu-db/bigdata-cmpt733/blob/master/Assignments/A4/A4.ipynb), and submit it to the CourSys activity Assignment 4.