# Lab 1: ML Life Cycle: Business Understanding and Problem Formulation

In [None]:
import pandas as pd
import numpy as np

In this lab, you will practice the first step of the machine learning life cycle: formulating a machine learning problem. But first, you will get more practice working with some of the Python machine learning packages that you will use throughout the machine learning life cycle to develop your models.

 ## Part 1. Practice Working with ML Python Tools

In this part of the lab you will:

1. Work with NumPy arrays and NumPy functions
2. Create Pandas DataFrames from data
3. Use NumPy and Pandas to analyze the data
4. Visualize the data with Matplotlib

<b>Note</b>: In Jupyter Notebooks, you can output a variable in two different ways: 

1. By writing the name of the variable 
2. By using the python `print()` function

The code cells below demonstrate this. Run each cell and inspect the results.

In [None]:
x = 5
x

In [None]:
x = 5
print(x)

If you want to output multiple items, you must use a `print()` statement. See the code cell below as an example.

In [None]:
y = 4
z = 3

print(y)
print(z)

## Practice Operating on NumPy Arrays

### a. Define a Python list

The code cell below defines a new list in Python.

In [None]:
python_list = [0,2,4,6,8,10,12,14]
python_list

### b. Define a Python range

The code cell below defines a Python range in that contains the same values as those in the list above.

In [None]:
python_range = range(0, 15, 2)
python_range

The above returns an object of type `range`. The code cell below coverts this object to a Python list using the Python `list()` function.
 

In [None]:
list(python_range)

### c. Define a NumPy range

<b>Task:</b> In the code cell below, use NumPy's `np.arange()` method to create a NumPy range that has the same output as the Python range above. Save the output to the variable `numpy_range`.

In [2]:
numpy_range = np.arange(0, 15, 2)
numpy_range

array([ 0,  2,  4,  6,  8, 10, 12, 14])

The code above returns an object of type `ndarray` (i.e. an array). 

<b>Task:</b> In the code cell below, convert the NumPy array `numpy_range` to a Python list.


In [4]:
list(numpy_range)

[0, 2, 4, 6, 8, 10, 12, 14]

### d. List comprehension

Consider the task of replacing each value in a list with its square. The traditional way of performing a transformation on every element of a list is via a `for` loop. 


<b>Task:</b> In the code cell below, use a `for` loop to replace every value in the list `my_list` with its square (e.g. 2 will become 4). In your loop, use a range.



In [5]:
my_list = [1,2,3,4]

for i in range(len(my_list)):
    my_list[i] = my_list[i] ** 2
    
print(my_list)

[1, 4, 9, 16]


There is a different, more 'Pythonic' way to accomplish the same task.<br>

*List comprehension* is one of the most elegant functionalities native to Python. It offers a concise way of applying a particular transformation to every element in a list. <br>

By using list comprehension syntax, we can write a single, easily interpretable line of code that does the same transformation without using ranges or an iterating index variable `i`:

In [None]:
my_list = [1,2,3,4]

my_list = [x**2 for x in my_list]

print(my_list)

### e. Create a NumPy array

<b>Task:</b> In the code cell below, create a `numpy` array that contains the integer values 1 through 4. Save the result to variable `arr`.

In [8]:
arr = np.array([1, 2, 3, 4])
print(arr)

[1 2 3 4]


### e.  Obtain the dimensions of the NumPy array

The NumPy function `np.shape()` returns the dimensions of a `numpy` array. Because `numpy` arrays can be two dimensional (i.e. a matrix), `np.shape()` returns a tuple that contains the lengths of the array's dimensions. You can consider the result to be the number of rows and the number of columns in the NumPy array.

<b>Task:</b> In the code cell below, use `np.shape()` to find the 'shape' of `numpy` array `arr`.

In [9]:
print(np.shape(arr))

(4,)


Notice that there appears to be an empty 'slot' for another number in the tuple.
Since `arr` is a one-dimensional array, we only care about the first element in the tuple that `np.shape()` returns.

<b>Task:</b> In the code cell below, obtain the length of `arr` by extracting the first element that is returned by `np.shape()`. Save the result to the variable `arr_length`.


In [10]:
arr_length = np.shape(arr)[0]

print(arr_length)

4


### f. Create a uniform (same value in every position) array

We will now use ```np.ones()``` to create an array of a specified length that contains the value '1' in each position:

In [None]:
np.ones(55, dtype=int)

We can use this method to create an array of any identical value. Let's create an array of length 13, filled with the value '7' in every position:

In [None]:
7 * np.ones(13, dtype=int)

### g. Create a two-dimensional NumPy array

Let us explore the possibilities of the ```np.array()``` function further. NumPy arrays can be of more than one dimension. The code cell below creates a two-dimensional `numpy` array (a matrix).

In [13]:
matrix = np.array([[1,2,3], [4,5,6]])
matrix

array([[1, 2, 3],
       [4, 5, 6]])

<b>Task:</b> In the code cell below, use `np.shape()` to find the dimensions of `matrix`.

In [14]:
np.shape(matrix)

(2, 3)

### h. Create an identity matrix

`np.eye()` is a NumPy function that creates an identity matrix of a specified size. An identity matrix is a matrix in which all of the values of the main diagonal are one, and all other values in the matrix are zero.

In [None]:
np.eye(5)

Check your intuition: What do you think will be the output after running this cell? Run the cell to see if you are correct.

In [None]:
A = np.eye(3)
B = 4 * np.eye(3)
A+B

### i. A small challenge:  matrix transformation and random matrix generation

The `np.triu()` function obtains the upper right triangle of a two-dimensional NumPy array (matrix). Inspect the documentation by running the command ```np.triu?``` in the cell below. 

In [16]:
np.triu?

<b>Task:</b> Inspect the code in the cell below, then run the code and note the resulting matrix `M`.

In [17]:
M = np.round(np.random.rand(5,5),2)
print("M=\n", M)

M=
 [[0.64 0.32 0.13 0.97 0.46]
 [0.84 0.49 0.21 0.54 0.81]
 [0.53 0.1  0.83 0.16 0.16]
 [0.46 0.04 0.52 0.42 0.83]
 [0.93 0.07 0.6  0.43 0.59]]


<b>Task:</b> Use `np.triu()` to create a matrix called ```new_M``` which is identical to the matrix```M```, except that in the lower triangle (i.e., all the cells below the diagonal), all values will be zero.

In [20]:
new_M = np.triu(M)
print("new_M=\n", new_M)

new_M=
 [[0.64 0.32 0.13 0.97 0.46]
 [0.   0.49 0.21 0.54 0.81]
 [0.   0.   0.83 0.16 0.16]
 [0.   0.   0.   0.42 0.83]
 [0.   0.   0.   0.   0.59]]


<b>Task:</b> Using the code provided above for generating the matrix ```M```, try creating a matrix with 13 rows and 3 columns containing random numbers. Save the resulting matrix to the variable `random_M`.

In [21]:
random_M = np.round(np.random.rand(13, 3), 2)
print("random_M=\n", random_M)

random_M=
 [[0.07 0.85 0.72]
 [0.44 0.77 0.5 ]
 [0.96 0.67 0.67]
 [0.41 0.31 0.42]
 [0.78 0.13 0.57]
 [0.72 0.8  0.9 ]
 [0.14 0.81 0.81]
 [0.38 0.61 0.06]
 [0.29 0.79 0.97]
 [0.19 0.22 0.8 ]
 [0.11 0.2  0.24]
 [0.23 0.77 0.65]
 [0.48 0.86 0.78]]


### j. Indexing and slicing two-dimensional NumPy arrays

The code cell below extracts an element of a two-dimensional NumPy array by indexing into the array by specifying its location. Just like Python lists, NumPy arrays use 0-based indexing.

In [None]:
random_M[3][2]

You can also use the following syntax to achieve the same result.

In [None]:
random_M[3,2]

You learned how to slice a Pandas DataFrames. You can use the same techniques to slice a NumPy array. 


<b>Task:</b> In the code cell below, use slicing to obtain the rows with the index 3 through 5 in `random_M`.

In [22]:
random_M[3:6]

array([[0.41, 0.31, 0.42],
       [0.78, 0.13, 0.57],
       [0.72, 0.8 , 0.9 ]])

<b>Task:</b> In the code cell below, use slicing to obtain all of the rows in the second column (column has the index of 1) of `random_M`.

In [23]:
random_M[:, 1]

array([0.85, 0.77, 0.67, 0.31, 0.13, 0.8 , 0.81, 0.61, 0.79, 0.22, 0.2 ,
       0.77, 0.86])

<b>Task:</b> Use the code cell below to perform slicing on `random_M` to obtain a portion of the array of your choosing.

In [24]:
random_M[0:5, [0,2]]

array([[0.07, 0.72],
       [0.44, 0.5 ],
       [0.96, 0.67],
       [0.41, 0.42],
       [0.78, 0.57]])

### k. Evaluating a Boolean condition

In real-life data tasks, you will often have to compute the boolean ```(True/False)``` value of some statement for all entries in a given NumPy array. You will formulate a condition &mdash; think of it as a *test* &mdash; and run a computation that returns `True` or `False` depending on whether the test passed or failed by a particular value in the array.

The condition may be something like "the value is greater than 0.5". You would like to know if this is true or false for every value in  the array. 

The code cells below demonstrates how to perform such a task on NumPy arrays.

First, we will create the array:

In [None]:
our_array = np.random.rand(1, 20)
print(our_array)

Next, we will apply a condition to the array:

In [None]:
is_greater = our_array > 0.5
print(is_greater)

Let's apply this technique to our matrix `random_M`. Let's inspect the matrix again as a refresher.

In [None]:
print(random_M)

<b>Task:</b> In the code cell below, determine whether the value of every element in the second column of `random_M` is greater than 0.5. Save the result to the variable `is_greater`.

In [26]:
is_greater = random_M[:, 1] > 0.5

print(is_greater)

[ True  True  True False False  True  True  True  True False False  True
  True]


We can use the function `np.any()` to determine if there is any element in a NumPy array that is True. Let us apply this to the array `is_greater` above. Using this function we can easily determine that indeed there are values greater than 0.5 in the second row of `random_M`.

In [None]:
np.any(is_greater)

Let's apply `np.any()` to another condition. 

<b>Task:</b> Use `np.any()` along with a conditional statement to determine if any value in the third row of `random_M` is less than .1.

In [27]:
any_less_than_point_one = np.any(random_M[2, :] < 0.1)

print(any_less_than_point_one)


False


## Practice Working With Pandas DataFrames

### a. Creating a DataFrame: two (of the many) ways

The code cells below demonstrate how we can create Pandas DataFrames in two ways: 

1. From a *list of lists*
2. From a *dictionary*

First, the cell below creates a DataFrame from a list containing phone numbers and their country codes. The DataFrame is named `df`. Run the cell below to inspect the DataFrame `df` that was created.

In [31]:
import pandas as pd

my_list = [['+1', '(929)-000-0000'], ['+34', '(917)-000-0000'], ['+7', '(470)-000-0000']]

df = pd.DataFrame(my_list, columns = ['country_code', 'phone'])
df

Unnamed: 0,country_code,phone
0,1,(929)-000-0000
1,34,(917)-000-0000
2,7,(470)-000-0000


Second, the cell below creates a DataFrame from a dictionary that contains the same information as the list above. The dictionary contains phone numbers and their country codes. Run the cell below to inspect the DataFrame `df_from_dict` that was created from the dictionary. Notice that both DataFrames `df` and `df_from_dict` contain the same values.

In [32]:
my_dict = {'country_code': ['+1', '+34', '+7'], 'phone':['(929)-000-0000', '(917)-000-0000', '(470)-000-0000']}

df_from_dict = pd.DataFrame(my_dict)
df_from_dict

Unnamed: 0,country_code,phone
0,1,(929)-000-0000
1,34,(917)-000-0000
2,7,(470)-000-0000


### b. Adding a column to a DataFrame object

We are going to continue working with the DataFrame `df` that was created above. The code cell below adds a new column of values to `df`. Run the cell and inspect the DataFrame to see the new column that was added.

In [33]:
df['grade']= ['A','B','A']
df

Unnamed: 0,country_code,phone,grade
0,1,(929)-000-0000,A
1,34,(917)-000-0000,B
2,7,(470)-000-0000,A


<b>Task:</b> In the cell below, create a new column in DataFrame `df` that contains the names of individuals.

* First, create a list containing three names of your choosing. 
* Next, create a new column in `df` called `names` by using the list you created.

In [34]:
names_list = ['Alice', 'Bob', 'Charlie']
df['names'] = names_list

print(df)

  country_code           phone grade    names
0           +1  (929)-000-0000     A    Alice
1          +34  (917)-000-0000     B      Bob
2           +7  (470)-000-0000     A  Charlie


### c. Sorting the DataFrame by values in a specific column

The `df.sort_values()` method sorts a DataFrame by the specified column. The code cell below will use `df.sort_values()` to sort DataFrame`df` by the values contained in column `grade`. The original DataFrame `df` will not be changed, so we will assign the resulting DataFrame to variable `df` to update the values in the DataFrame.

In [35]:
df = df.sort_values(['grade'])
df

Unnamed: 0,country_code,phone,grade,names
0,1,(929)-000-0000,A,Alice
2,7,(470)-000-0000,A,Charlie
1,34,(917)-000-0000,B,Bob


### d. Combining multiple DataFrames  and renaming  columns with `df.rename()`

In real life settings, you will often need to combine separate sets of related data. Two functions used for this purpose are `pd.concat()` and `pd.merge()`.


To illustrate, let's create a new DataFrame. The code cell below creates a new DataFrame `df2` that also contains phone numbers, their country codes and a grade. Run the cell and inspect the new DataFrame that was created.

In [36]:
my_dict2 = {'country': ['+32', '+81', '+11'], 'grade':['B', 'B+', 'A'], 'phone':['(874)-444-0000', '(313)-003-1000', '(990)-006-0660']}

df2 = pd.DataFrame(my_dict2)
df2

Unnamed: 0,country,grade,phone
0,32,B,(874)-444-0000
1,81,B+,(313)-003-1000
2,11,A,(990)-006-0660


The code cell below uses the Pandas ```pd.concat()``` function to append `df2` to `df`. The `pd.concat()` function will not change the values in the original DataFrames, so we will save the newly formed DataFrame to variable `df_concat`. 

In [37]:
df_concat = pd.concat([df,df2])
df_concat

Unnamed: 0,country_code,phone,grade,names,country
0,1.0,(929)-000-0000,A,Alice,
2,7.0,(470)-000-0000,A,Charlie,
1,34.0,(917)-000-0000,B,Bob,
0,,(874)-444-0000,B,,32.0
1,,(313)-003-1000,B+,,81.0
2,,(990)-006-0660,A,,11.0


Notice that the new DataFrame `df_concat` contains two columns containing country codes. This is because the two original DataFrames contained different spellings for the columns. 


We can easily fix this by changing the name of the column in DataFrame `df2` to be consistent with the name of the column in DataFrame `df`.

In [38]:
df2 = df2.rename(columns={'country':'country_code'})
df2

Unnamed: 0,country_code,grade,phone
0,32,B,(874)-444-0000
1,81,B+,(313)-003-1000
2,11,A,(990)-006-0660


<b>Task</b>: In the cell below, run the `pd.concat()` function again to concatenate DataFrames `df` and `df2` and save the resulting DataFrame to variable `df_concat2`. Run the cell and inspect the results.

In [39]:
df_concat2 = pd.concat([df, df2])
print(df_concat2)

  country_code           phone grade    names
0           +1  (929)-000-0000     A    Alice
2           +7  (470)-000-0000     A  Charlie
1          +34  (917)-000-0000     B      Bob
0          +32  (874)-444-0000     B      NaN
1          +81  (313)-003-1000    B+      NaN
2          +11  (990)-006-0660     A      NaN


One other problem is that the index has repeated values. This defeats the purpose of an index, and ought to be fixed. Let's try the concatenation again, this time adding `reset_index()` method to produce correct results:

In [40]:
df_concat2 = pd.concat([df,df2]).reset_index()
df_concat2

Unnamed: 0,index,country_code,phone,grade,names
0,0,1,(929)-000-0000,A,Alice
1,2,7,(470)-000-0000,A,Charlie
2,1,34,(917)-000-0000,B,Bob
3,0,32,(874)-444-0000,B,
4,1,81,(313)-003-1000,B+,
5,2,11,(990)-006-0660,A,


Now we have one column for `country_code`. Notice that we have missing values for the names of individuals, since names  were contained in `df` but not in `df2`. In a future unit, you will learn how to deal with missing values.

What if our task were to merge ```df2``` with yet another dataset &mdash; one that contains additional unique columns? Let's look at DataFrame `df2` again:

In [41]:
df2

Unnamed: 0,country_code,grade,phone
0,32,B,(874)-444-0000
1,81,B+,(313)-003-1000
2,11,A,(990)-006-0660


The code cell below creates a new DataFrame `df3`.

In [42]:
my_dict3 = {'country_code': ['+32', '+44', '+11'], 'phone':['(874)-444-0000', '(575)-755-1000', '(990)-006-0660'], 'grade':['B', 'B+', 'A'], 'n_credits': [12, 3, 9]}

df3 = pd.DataFrame(my_dict3)
df3

Unnamed: 0,country_code,phone,grade,n_credits
0,32,(874)-444-0000,B,12
1,44,(575)-755-1000,B+,3
2,11,(990)-006-0660,A,9


The following code cell merges both DataFrames based on the values contained in the `phone` column. If one column in both DataFrames contains the same value, the rows in which the value appears are merged. Otherwise, the row will not be included in the updated DataFrame. Run the code cell below and inspect the new DataFrame `df_merged`.

In [43]:
df_merged = df2.merge(df3, on = 'phone')
df_merged

Unnamed: 0,country_code_x,grade_x,phone,country_code_y,grade_y,n_credits
0,32,B,(874)-444-0000,32,B,12
1,11,A,(990)-006-0660,11,A,9


## Practice Working With a Dataset

We are now well equipped to deal with a real dataset! Our dataset will contain information about New York City listings on the Airbnb platform.

### a. Load the dataset: `pd.read_csv()`

The code cell below loads a dataset from a CSV file and saves it to a Pandas DataFrame. 

First, we will import the `OS` module. This module enables you to interact with the operating system, allowing you access to file names, etc.



In [44]:
import os 

Next, we will use the `os.path.join()` method to obtain a path to our data file. This method concatenates different path components (i.e. directories and a file name, into one file system path). We will save the results of this method to the variable name `filename`.

Now that we have a path to our CSV file, we will use the `pd.read_csv()` method to load the CSV file into a Pandas DataFrame named `dataFrame`.

Examine the code in the cell below and run the cell.

<b>Note</b>: the cell below may generate a warning. Ignore the warning. 

In [47]:
filename = os.path.join(os.getcwd(), "data", "airbnbData.csv") 
dataFrame = pd.read_csv(filename)

  interactivity=interactivity, compiler=compiler, result=result)


In [48]:
dataFrame.shape

(38277, 63)

First, get a peek at the data:

In [49]:
dataFrame.head()

Unnamed: 0,id,scrape_id,last_scraped,host_id,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,2595,20211200000000.0,12/5/21,2845,9/9/08,within a day,80%,17%,f,Midtown,...,4.79,4.86,4.41,,f,3,3,0,0,0.33
1,3831,20211200000000.0,12/5/21,4869,12/7/08,a few days or more,9%,69%,f,Clinton Hill,...,4.8,4.71,4.64,,f,1,1,0,0,4.86
2,5121,20211200000000.0,12/5/21,7356,2/3/09,within an hour,100%,100%,f,Bedford-Stuyvesant,...,4.91,4.47,4.52,,f,2,0,2,0,0.52
3,5136,20211200000000.0,12/5/21,7378,2/3/09,within a day,100%,25%,f,Greenwood Heights,...,5.0,4.5,5.0,,f,1,1,0,0,0.02
4,5178,20211200000000.0,12/5/21,8967,3/3/09,within a day,100%,100%,f,Hell's Kitchen,...,4.42,4.87,4.36,,f,1,0,1,0,3.68


When using the `head()` method, you can specify the number of rows you would like to see by calling `head()` with an integer parameter (e.g. `head(2)`).

### b. Get column names: `df.columns`

Let us retrieve just the list of column names.

In [50]:
list(dataFrame.columns)

['id',
 'scrape_id',
 'last_scraped',
 'host_id',
 'host_since',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bathrooms_text',
 'bedrooms',
 'beds',
 'amenities',
 'price',
 'minimum_nights',
 'maximum_nights',
 'minimum_minimum_nights',
 'maximum_minimum_nights',
 'minimum_maximum_nights',
 'maximum_maximum_nights',
 'minimum_nights_avg_ntm',
 'maximum_nights_avg_ntm',
 'calendar_updated',
 'has_availability',
 'availability_30',
 'availability_60',
 'availability_90',
 'availability_365',
 'calendar_last_scraped',
 'number_of_reviews',
 'number_of_reviews_ltm',
 'number_of_reviews_l30d',
 'first_review',
 'last_review',


What do the column names mean? Some of them are less intuitively interpretable than others. <br>
Careful data documentation is indispensable for business analytics. You can consult the documentation that accompanies this open source dataset for a detailed description of the key variable names, what they represent, and how they were generated.

### c. Summary statistics of the DataFrame: `df.describe()`

Let's print some general statistics for each one of the `data` columns:

In [51]:
dataFrame.describe(include='all')

Unnamed: 0,id,scrape_id,last_scraped,host_id,host_since,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_neighbourhood,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,38277.0,38277.0,38277,38277.0,38243,21084,21084,21791,38243,30813,...,28165.0,28151.0,28150.0,1,38277,38277.0,38277.0,38277.0,38277.0,28773.0
unique,,,2,,4289,4,88,101,2,484,...,,,,1,2,,,,,
top,,,12/5/21,,10/29/19,within an hour,100%,100%,f,Bedford-Stuyvesant,...,,,,41662/AL,f,,,,,
freq,,,31879,,433,11151,13299,5342,30865,2138,...,,,,1,27851,,,,,
mean,29622390.0,20211200000000.0,,114830500.0,,,,,,,...,4.807454,4.750307,4.646892,,,17.747655,8.042637,9.593934,0.047966,1.721019
std,17422390.0,0.0,,129919400.0,,,,,,,...,0.465544,0.416101,0.518905,,,59.150451,34.977178,43.310123,0.426789,4.399826
min,2595.0,20211200000000.0,,2438.0,,,,,,,...,0.0,0.0,0.0,,,1.0,0.0,0.0,0.0,0.01
25%,13410480.0,20211200000000.0,,11394620.0,,,,,,,...,4.81,4.67,4.55,,,1.0,0.0,0.0,0.0,0.12
50%,30812690.0,20211200000000.0,,50052970.0,,,,,,,...,4.97,4.88,4.78,,,1.0,1.0,0.0,0.0,0.48
75%,46428550.0,20211200000000.0,,200239500.0,,,,,,,...,5.0,5.0,5.0,,,3.0,1.0,1.0,0.0,1.78


### d. Filtering the data: `df[ < condition > ]`

Consider the following business question: What is the average availability (out of 365 days in a year) for the listings in Brooklyn? <br>

The answer can be obtained by the use of **filters** on the dataset. We need to filter the entries that are in Brooklyn. To do this, we need to know the exact way that Manhattan listings are spelled and entered in the data. Let's print all of the unique values of the `neighbourhood` column:

In [52]:
dataFrame['neighbourhood'].unique()

array(['New York, United States', 'Brooklyn, New York, United States',
       nan, 'Queens, New York, United States',
       'Long Island City, New York, United States',
       'Astoria, New York, United States',
       'Bronx, New York, United States',
       'Staten Island, New York, United States',
       'Elmhurst, New York, United States',
       'Riverdale , New York, United States',
       'Briarwood, New York, United States',
       'Kips Bay, New York, United States',
       'Jackson Heights, New York, United States',
       'New York, Manhattan, United States',
       'Park Slope, Brooklyn, New York, United States',
       'Kew Gardens, New York, United States',
       'Flushing, New York, United States',
       'Astoria , New York, United States',
       'Sunnyside, New York, United States',
       'Woodside, New York, United States',
       'NY , New York, United States',
       'Bushwick, Brooklyn, New York, United States',
       'Brooklyn , New York, United States', 'Uni

You may have noticed that there is a lot of heterogeneity in the way `neighbourhood` values are specified. The values are not standardized. There are overlaps, redundancies, and inconsistencies (e.g., some entries specify ```'Greenpoint, Brooklyn, New York, United States'```, some other ones list `'BROOKLYN, New York, United States',`, yet other ones say `'Williamsburg, Brooklyn, New York, United States'`, etc. In real life, you would have to clean this data and replace these values with standard, identically formated, consistent values. <br>

For this dataset, we are lucky to already have a 'cleansed' version of the neighborhood information based on the latitude and the longitude of every listing location. 

We will list the unique values of the columns titled `neighbourhood_cleansed` and `neighbourhood_group_cleansed`:

In [53]:
dataFrame['neighbourhood_cleansed'].unique()

array(['Midtown', 'Bedford-Stuyvesant', 'Sunset Park', 'Upper West Side',
       'South Slope', 'Williamsburg', 'East Harlem', 'Fort Greene',
       "Hell's Kitchen", 'East Village', 'Harlem', 'Flatbush',
       'Long Island City', 'Jamaica', 'Greenpoint', 'Nolita', 'Chelsea',
       'Upper East Side', 'Prospect Heights', 'Clinton Hill',
       'Washington Heights', 'Kips Bay', 'Bushwick', 'Carroll Gardens',
       'West Village', 'Park Slope', 'Prospect-Lefferts Gardens',
       'Lower East Side', 'East Flatbush', 'Boerum Hill', 'Sunnyside',
       'St. George', 'Tribeca', 'Highbridge', 'Ridgewood', 'Mott Haven',
       'Morningside Heights', 'Gowanus', 'Ditmars Steinway',
       'Middle Village', 'Brooklyn Heights', 'Flatiron District',
       'Windsor Terrace', 'Chinatown', 'Greenwich Village',
       'Clason Point', 'Crown Heights', 'Astoria', 'Kingsbridge',
       'Forest Hills', 'Murray Hill', 'University Heights', 'Gravesend',
       'Allerton', 'East New York', 'Stuyvesant Town

In [54]:
dataFrame['neighbourhood_group_cleansed'].unique()

array(['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx'],
      dtype=object)

Let's filter out all data entries that pertain to Brooklyn listings:

In [55]:
bk = dataFrame[dataFrame['neighbourhood_group_cleansed'] == 'Brooklyn']
bk.shape

(14716, 63)

<b>Tip</b>: to better understand what happened above, in the code cell below, you are encouraged to copy *just the condition* of the filter that we used on the `data` object above: `dataFrame['neighbourhood_group_cleansed'] == 'Brooklyn'`. 

Run the cell and see what that condition alone evaluates to. You should see a Pandas series containing True/False values. When we use that series as a Boolean filter by writing `dataFrame[ < our Boolean series > ]`, i.e `dataFrame['neighbourhood_group_cleansed'] == 'Brooklyn']`, we are telling Pandas to keep the values in the DataFrame `dataFrame` only with those indices for which the condition evaluated to `True`. 

In [56]:
dataFrame['neighbourhood_group_cleansed'] == 'Brooklyn'

0        False
1         True
2         True
3         True
4        False
         ...  
38272    False
38273    False
38274    False
38275    False
38276     True
Name: neighbourhood_group_cleansed, Length: 38277, dtype: bool


### e. Combining values in a column: `np.mean()`

Now that we isolated only the relevant entries, it remains to average the value of a particular column that we care about:

In [57]:
np.mean(bk['availability_365'])

118.7693666757271

### f. Group data by (categorical) column values: `df.groupby()`

The next question of interest could be:<br>

What are the top 5 most reviewed neighborhoods in New York? (By sheer number of reviews, regardless of their quality). <br>

We will use the Pandas ```df.groupby()``` method to determine this:

In [None]:
nbhd_reviews = dataFrame.groupby('neighbourhood_cleansed')['number_of_reviews'].sum()
nbhd_reviews.head()

Perform a (descending order) sorting on this series:

In [None]:
nbhd_reviews = nbhd_reviews.sort_values(ascending = False)
nbhd_reviews.head(5)

What are the least reviewed neighborhoods?

In [None]:
nbhd_reviews.tail(5)

This result makes it apparent that our dataset is somewhat messy!

Notice we could have chained the transformations above into a single command, as in:

In [None]:
dataFrame.groupby('neighbourhood_cleansed')['number_of_reviews'].sum().sort_values(ascending = False).head(5)

This way we don't store objects that we won't need.

### Bonus: Histogram plotting with Matplotlib: `plt.hist()`

As a final touch, run the cell below to visualize the density of average values of review numbers across all neighborhoods. <b>Note:</b> The cell may take a few seconds to run.

In [None]:
%matplotlib inline
nbhd_reviews.hist()

This plot suggests that the vast majority of neighborhoods have only very few reviews, with just a handful of outliers (those ranked at the top in our previous computed cell) having the number of reviews upward of 40000. 

## Part 2. ML Life Cycle: Business Understanding and Problem Formulation

In this part of the lab, you will practice the first step of the machine learning life cycle: business understanding and problem formulation.

Recall that the first step of the machine learning life cycle involves understanding and formulating your ML business problem, and the second step involves data understanding and preparation. In this lab however, we will first provide you with data and have you formulate a machine learning business problem based on that data.

We have provided you with four datasets that you will use to formulate a machine learning problem.

1. <b>HousingPrices.csv</b>: dataset that contains information about a house's characteristics (number of bedrooms, etc.) and its purchase price.

2. <b>Top100Restaurants2020.csv</b>: dataset that contains information about 100 top rated restaurants in 2020.

3. <b>ZooData.csv</b>: dataset that contains information about a variety of animals and their characteristics.

4. <b>FlightInformation.csv</b>: dataset that contains flight information.

The code cells below use the specified paths and names of the files to load the data into four different DataFrames.

<b>Task \#1</b>: After you run a code cell below to load the data, use some of the techniques you have practiced to inspect the data. Do the following: 

1. Inspect the first 10 rows of each DataFrame.
2. Inspect all of the column names in each DataFrame.
3. Obtain the shape of each DataFrame.

(Note: You can add more cells below to accomplish this task by going to the `Insert` Menu and clicking on `Insert Cell Below`. By default, the new code cell will be of type `Code`.)

<b>Task \#2</b>: Once you have an idea of what is contained in a dataset, you will formulate a machine learning problem for that dataset. This will be a predictive problem. For example, the Airbnb dataset you worked with above can be used to train a machine learning model that can predict the price of a new Airbnb. 

Come up with at least one machine learning problem per dataset. Specify what you would like to use the data to predict in the future. Since these will be supervised learning problems, specify whether it is a classification (binary or multiclass) or a regression problem. List the label and feature columns. 

Note: Make sure you successfully ran the cell above that loads the `OS` module prior to running the cells below.

<b>Housing Prices Dataset</b>:

In [58]:
filename1 = os.path.join(os.getcwd(), "data", "HousingPrices.csv") 

dataFrame1 = pd.read_csv(filename1)

Inspect the data:

In [59]:
print(dataFrame1.head(10))
print(dataFrame1.columns)
print(dataFrame1.shape)

      price   area  bedrooms  bathrooms  stories mainroad guestroom basement  \
0  13300000   7420         4          2        3      yes        no       no   
1  12250000   8960         4          4        4      yes        no       no   
2  12250000   9960         3          2        2      yes        no      yes   
3  12215000   7500         4          2        2      yes        no      yes   
4  11410000   7420         4          1        2      yes       yes      yes   
5  10850000   7500         3          3        1      yes        no      yes   
6  10150000   8580         4          3        4      yes        no       no   
7  10150000  16200         5          3        2      yes        no       no   
8   9870000   8100         4          1        2      yes       yes      yes   
9   9800000   5750         3          2        4      yes       yes       no   

  hotwaterheating airconditioning  parking prefarea furnishingstatus  
0              no             yes        2      

Formulate ML Business Problem:

<Double click this Markdown cell to make it editable, and add record your problem formulation here.>

<b>Restaurants Dataset</b>:

In [61]:
filename2 = os.path.join(os.getcwd(), "data", "Top100Restaurants2020.csv") 

dataFrame2 = pd.read_csv(filename2)

Inspect the data:

In [62]:
print(dataFrame2.head(10))
print(dataFrame2.columns)
print(dataFrame2.shape)

   Rank                           Restaurant       Sales  Average Check  \
0     1             Carmine's (Times Square)  39080335.0             40   
1     2                The Boathouse Orlando  35218364.0             43   
2     3                     Old Ebbitt Grill  29104017.0             33   
3     4  LAVO Italian Restaurant & Nightclub  26916180.0             90   
4     5             Bryant Park Grill & Cafe  26900000.0             62   
5     6             Gibsons Bar & Steakhouse  25409952.0             80   
6     7       Top of the World at the STRAT   25233543.0            103   
7     8                          Maple & Ash  24837595.0             99   
8     9                            Balthazar  24547800.0             87   
9    10                    Smith & Wollensky  24501000.0            107   

         City State  Meals Served                    Category  
0    New York  N.Y.      469803.0               Italian/Pizza  
1    Orlando   Fla.      820819.0             

Formulate ML Business Problem:

<Double click this Markdown cell to make it editable, and add record your problem formulation here.>

<b>Zoo Dataset</b>:

In [63]:
filename3 = os.path.join(os.getcwd(), "data", "ZooData.csv") 

dataFrame3 = pd.read_csv(filename3)

Inspect the data:

In [64]:
print(dataFrame3.head(10))
print(dataFrame3.columns)
print(dataFrame3.shape)

  animal_name   hair  feathers   eggs   milk  airborne  aquatic  predator  \
0    aardvark   True     False  False   True     False    False      True   
1    antelope   True     False  False   True     False    False     False   
2        bass  False     False   True  False     False     True      True   
3        bear   True     False  False   True     False    False      True   
4        boar   True     False  False   True     False    False      True   
5     buffalo   True     False  False   True     False    False     False   
6        calf   True     False  False   True     False    False     False   
7        carp  False     False   True  False     False     True     False   
8     catfish  False     False   True  False     False     True      True   
9        cavy   True     False  False   True     False    False     False   

   toothed  backbone  breathes  venomous   fins  legs   tail  domestic  \
0     True      True      True     False  False     4  False     False   
1   

Formulate ML Business Problem:

<Double click this Markdown cell to make it editable, and add record your problem formulation here.>

<b>Flight Dataset</b>:

In [65]:
filename4 = os.path.join(os.getcwd(), "data", "FlightInformation.csv") 

dataFrame4 = pd.read_csv(filename4)

Inspect the data:

In [66]:
print(dataFrame4.head(10))
print(dataFrame4.columns)
print(dataFrame4.shape)

   id Airline  Flight AirportFrom AirportTo  DayOfWeek  Time  Length  Delay
0   1      CO     269         SFO       IAH          3    15     205      1
1   2      US    1558         PHX       CLT          3    15     222      1
2   3      AA    2400         LAX       DFW          3    20     165      1
3   4      AA    2466         SFO       DFW          3    20     195      1
4   5      AS     108         ANC       SEA          3    30     202      0
5   6      CO    1094         LAX       IAH          3    30     181      1
6   7      DL    1768         LAX       MSP          3    30     220      0
7   8      DL    2722         PHX       DTW          3    30     228      0
8   9      DL    2606         SFO       MSP          3    35     216      1
9  10      AA    2538         LAS       ORD          3    40     200      1
Index(['id', 'Airline', 'Flight', 'AirportFrom', 'AirportTo', 'DayOfWeek',
       'Time', 'Length', 'Delay'],
      dtype='object')
(65499, 9)


Formulate ML Business Problem:

Housing Prices: Predict the sale price of a house based on its characteris. Regression problem where the target variable is the house price.
Top 100 Restaurants: Predict whether a restuarent will be ranked within the top 100 next year based on current attributes. Binary classification problem.
Zoo Data: Predict the type of a animal based on its physical and behavioral traits. Multiclass classification problem.
Flight Information: Predict whether a flight will be delayed or on time based on factors. Binary classification problem. 

<b>Next Steps</b>: The second step of the machine learning life cycle is data understanding and data preparation. You practiced some aspects of data understanding when using NumPy and Pandas to inspect the Airbnb dataset. You will learn more about this second step of the machine learning life cycle in the next unit.