# Lab 2 - Python and Pandas 

Upon successful completion of this assignment, a student will be able to:

* Gain experience in formatting text using Markdown
* Load in a data set, access it, and explore its properties.
* Submit assignment to Gradescope.

We start with the standard setup for our notebook files importing standard modules.

In [None]:
#  Import standard modules  
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import otter
grader = otter.Notebook()

## Example 1 - More Data Cleaning 
*Adapted from J. Sullivan*

Let's look at another data file to see additional data cleaning steps and code.  

The initial data set reads in part: 

![property data](images/property-data.jpg)

In [None]:
prop = pd.read_csv("data/property.csv")
prop

We can see that `pandas` is already able to find some of the different ways that we have missing values in the data.

For instance in the ST_NUM column, the 3rd entry is blank and the 7th entry is NaN.  `pandas` filled in the blank entry with "NA".  Both of these values are found by the `isnull()` method.

In [None]:
prop['ST_NUM'].isnull()

However, there are other missing value encodings that pandas does not immediately recognize. 

Let's look at the Num_Bedrooms column. 

![property data 2](images/property-data2.jpg)




In this column, we have missing values as "n/a", "NA", "--" and "na".

Let's see what `pandas` automatically recognizes.

In [None]:
prop['NUM_BEDROOMS'].isnull()

`pandas` automatically recognizes the "n/a" and "NA" but not the "--" and "na". 

Let's change that! 

In [None]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--", "NA"]
prop2 = pd.read_csv("data/property.csv", na_values = missing_values)

In [None]:
print (prop2['NUM_BEDROOMS'])
print (prop2['NUM_BEDROOMS'].isnull())

## Example 2 - Printing


In many courses, tutorials for new languages the first thing you learn is printing "Hello World"

In [None]:
print('Hello World')

In [None]:
firstName = "Fill in name"

In [None]:
"Hello " + firstName + "!"

Use inbuilt function `dir()` to the variable "firstName" above and print the outcome.

https://docs.python.org/3/library/functions.html#dir

In [None]:
dir(firstName)

This lists all the functions available to be used on the "string" `firstName'

## Q1 - Strings

I want you to explore using the string functions: `len()`, `split()`, and `strip()` on the following strings. 

https://docs.python.org/3/library/functions.html

In [None]:
className = " Introduction    to   Data Science   "

In [None]:
# Show how to find the length of the string "className" 
# Store the results in a new variable "class_length"
class_length = ...
class_length

In [None]:
# Show the results of the `split()` function on the string "className"  
# Store the results in a new variable "class_split"
class_split = ...
class_split

In [None]:
# Save the results of the `strip()` function on the string "className" in a 
# new variable "className2"
className2 = ...
className2

In [None]:
grader.check("q1")

## Example 3 - Comments 

To create a comment line (in line with the code), # (hash) symbol is used, followed by a space. (Short key: `Ctrl + /` ) [To comment out, remove # or use `Ctrl + /` again]

Other options are using the triple quotes (""")or (''') known as backticks, to enclose the complete sentence as a comment.(This needs to be on different line other than the code). Different programming language has different approches for commenting. Please be aware.

In [None]:
# This is a comment

In [None]:
'''This is a larger comment block 
that may span multiple lines 
'''
2 + 2

<!-- BEGIN QUESTION -->

## Q2 - Markdown

Markdown option for cells in the jupyter notebook provides a way to display information to the use around the particular code snippets. For more information and reading, please look into:

https://help.github.com/articles/basic-writing-and-formatting-syntax/

https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html

Colab's Markdown Guide: https://colab.research.google.com/notebooks/markdown_guide.ipynb#scrollTo=5Y3CStVkLxqt

For this exercise add a new 'Text' cell and try to recreate the following block of text below the two lines. 
<hr><hr>

![example markdown](images/markdown-example.png)




*Enter your Markdown here* to recreate the formated text above. 

<!-- END QUESTION -->

## Example 4 - String Operations 

Here you can see some more operations working with strings.

https://docs.python.org/3/library/stdtypes.html#str

In [None]:
str = "Hello Data Science 2024"

In [None]:
print(str.find("2024"))

In [None]:
print(str[-4:])

In [None]:
str.upper()

In [None]:
str.lower()

In [None]:
str + ' & ' + 'FutureDataScientist'

<br/><br/>
<hr style="border: 5px solid #8a8c8c;" />
<hr style="border: 1px solid #ffcd00;" />

# Pandas Review 

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will review commonly used data-wrangling operations/tools in `pandas`. We aim to give you familiarity with:

* Creating `DataFrames`,
* Slicing `DataFrames` (i.e., selecting rows and columns)
* Filtering data (using boolean arrays)

In this lab, you are going to use several `pandas` methods. Reminder from lecture that you may press `shift+tab` on method parameters to see the documentation for that method. For example, if you were using the `drop` method in `pandas`, you could press `shift+tab` to see what `drop` is expecting.

**Note**: The `pandas` interface is notoriously confusing for beginners, and the documentation is not consistently great. Throughout the semester, you will have to search through [`pandas` documentation](https://pandas.pydata.org/docs/reference/index.html) and experiment, but remember it is part of the learning experience and will help shape you as a data scientist!




---


### **REVIEW:** Creating `DataFrames` & Basic Manipulations

Recall that a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe) is a table in which each column has a specific data type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

Usually, you'll create `DataFrames` by using a function like `pd.read_csv`. However, in this section, we'll discuss how to create them from scratch.

The [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) for the `pandas` `DataFrame` class provides several constructors for the `DataFrame` class.

**Syntax 1:** You can create a `DataFrame` by specifying the columns and values using a dictionary, as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [None]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink'],
          'price': [1.0, 0.75, 0.35, 0.05]
          })
fruit_info

**Syntax 2:** You can also define a `DataFrame` by specifying the rows as shown below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [None]:
fruit_info2 = pd.DataFrame(
    [("red", "apple", 1.0), ("orange", "orange", 0.75), ("yellow", "banana", 0.35),
     ("pink", "raspberry", 0.05)], 
    columns = ["color", "fruit", "price"])
fruit_info2

You can also convert the entire `DataFrame` into a two-dimensional `NumPy` array. Remember that a `NumPy` array can hold homogenous data whereas a `DataFrame` can contain heterogeneous data. 

In [None]:
numbers = pd.DataFrame({"A":[1, 2, 3], "B":[0, 1, 1]})
numpy_numbers = numbers.to_numpy()

print(type(numpy_numbers))
print(numpy_numbers)

The `values` attribute returns the content of the `DataFrame` in the form of a list of lists.

In [None]:
fruit_info.values

## **REVIEW:** Selecting Rows and Columns in `pandas`

As you've seen in lecture, there are two verbose operators in Python for selecting rows: `loc` and `iloc`. Let's review them briefly.

#### **Approach 1:** `loc`

The first of the two verbose operators is `loc`, which takes two arguments. The first is one or more **row labels**, the second is one or more **column labels** - both of which are displayed in bold to the left of each of the rows and above each of the columns, respectively. These are not the same as positional indices, which are used for indexing Python lists or `NumPy` arrays!

The desired rows or columns can be provided individually, in slice notation, or as a list. Some examples are given below.

Note that **slicing in `loc` is inclusive** on the provided labels.

In [None]:
# Get rows 0 through 2 (inclusive) with labels 'fruit' through 'price' 
#  (which would include the color column that is in between both labels)
fruit_info.loc[0:2, 'fruit':'price']

In [None]:
# Get rows 0 through 2 (inclusive) and columns 'fruit' and 'price'. 
# Note the difference in notation and result from the previous example.
fruit_info.loc[0:2, ['fruit', 'price']]

In [None]:
# Get rows 0 and 2 and columns fruit and price. 
fruit_info.loc[[0, 2], ['fruit', 'price']]

In [None]:
# Get rows 0 and 2 and column fruit
fruit_info.loc[[0, 2], ['fruit']]

Note that if we request a single column but don't enclose it in a list, the return type of the `loc` operator is a `Series` rather than a `DataFrame`. 

In [None]:
# Get rows 0 and 2 and column fruit, returning the result as a Series
fruit_info.loc[[0, 2], 'fruit']

If we provide only one argument to `loc`, it uses the provided argument to select rows, and returns all columns.

In [None]:
fruit_info.loc[0:1]

Note that if you try to access columns without providing rows, `loc` will crash. 

In [None]:
# Uncomment, this code will crash
#fruit_info.loc[["fruit", "price"]]

# Uncomment, this code works fine: 
fruit_info.loc[:, ["fruit", "price"]]

<br/>

---

### **Approach 2:** `iloc`

`iloc` is very similar to `loc` except that its arguments are **row numbers** and **column numbers**, rather than row and column labels. A useful mnemonic is that the `i` stands for "integer". This is quite similar to indexing into a Python `list` or `NumPy` array.

In addition, **slicing for `iloc` is exclusive** on the provided integer indices. Some examples are given below:

In [None]:
# Get rows 0 through 3 (exclusive) and columns 0 through 3 (exclusive)
fruit_info.iloc[0:3, 0:3]

In [None]:
# Get rows 0 through 3 (exclusive) and columns 0 and 2.
fruit_info.iloc[0:3, [0, 2]]

In [None]:
# Get rows 0 and 2 and columns 0 and 2.
fruit_info.iloc[[0, 2], [0, 2]]

In [None]:
# Get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], [0]]

In [None]:
# Get rows 0 and 2 and column fruit
fruit_info.iloc[[0, 2], 0]

Note that in these `loc` and `iloc` examples above, the row **label** and row **number** were always the same.

Let's see an example where they are different. If we sort our fruits by price, we get:

In [None]:
fruit_info_sorted = fruit_info.sort_values("price")
fruit_info_sorted

After sorting, note how row number 0 now has index label 3, row number 1 now has index label 2, etc. These indices are the arbitrary numerical indices generated when we created the `DataFrame`. For example, `banana` was originally in row 2, and so it has row label 2. Note the distinction between the index _label_, and the actual index _position_.

If we request the rows in positions 0 and 2 using `iloc`, we're indexing using the row NUMBERS, not labels. 

In [None]:
fruit_info_sorted.iloc[[0, 2], 0]

Lastly, similar to `loc`, the second argument to `iloc` is optional. That is, if you provide only one argument to `iloc`, it treats the argument you provide as a set of desired row numbers, not column numbers.

In [None]:
fruit_info_sorted.iloc[[0, 2]]

<br>

---

### **Approach 3** `[]` Notation

`pandas` also supports the `[]` operator. It's similar to `loc` in that it lets you access rows and columns by their name.

However, unlike `loc`, which takes row names and also optionally column names, `[]` is more flexible. If you provide it only row names, it'll give you rows (same behavior as `loc`), and if you provide it with only column names, it'll give you columns (whereas `loc` will crash).

Some examples:

In [None]:
fruit_info[0:2]

In [None]:
# Here we're providing a list of fruits as single argument to []
fruit_info[["fruit", "color", "price"]]

Note that slicing notation is not supported for columns if you use `[]` notation. Use `loc` instead.

In [None]:
# Uncomment and this code crashes
#fruit_info["fruit":"price"]

# Uncomment and this works fine
fruit_info.loc[:, "fruit":"price"]

`[]` and `loc` are quite similar. For example, the following two pieces of code are functionally equivalent for selecting the fruit and price columns.

1. `fruit_info[["fruit", "price"]]` 
2. `fruit_info.loc[:, ["fruit", "price"]]`.

Because it yields more concise code, you'll find that our code and your code both tend to feature `[]`. However, there are some subtle pitfalls of using `[]`. If you're ever having performance issues, weird behavior, or you see a `SettingWithCopyWarning` in `pandas`, switch from `[]` to `loc`, and this may help.

To avoid getting too bogged down in indexing syntax, we'll avoid a more thorough discussion of `[]` and `loc`. We may return to this at a later point in the course.

For more on `[]` vs. `loc`, you may optionally try reading:
1. https://stackoverflow.com/questions/48409128/what-is-the-difference-between-using-loc-and-using-just-square-brackets-to-filte
2. https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc/65875826#65875826
3. https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas/53954986#53954986

## Q3 - Pandas 

Pandas Resources:
* https://pandas.pydata.org/
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

We are going to be using the Abalone data set.  This is part of the UCI Machine Learning repository.  A common place to find data sets to test out code and used in learning about machine learning and data science. 

I have already downloaded the data from https://archive.ics.uci.edu/dataset/1/abalone
 
In the next cell, you will modify the code to read in the `abalone.data` file properly.  Use the following names for the columns:  
`sex`, `len`, `diam`, `hgt`, `wh_wgt`, `shuck_wgt`, `vis_wgt`, `sh_wgt`, `rings`

*HINT:* You will need to look at using additional parameters for the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. It will be helpful to look at the documentation on `read_csv`   
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html



In [None]:
df = pd.read_csv(...)  # modify this code to properly read the data
# specify the column names using the information above as an argument to the 
# read_csv function
df.head()

In [None]:
grader.check("q3")

## Q4 - Pandas 

Here you will explore properties of the DataFrame and its attributes.

In [None]:
# Determine the number of rows and columns of the data set 
rows = ...
columns = ...

# Determine what are the column names 
dfColumnNames = ... 

print(f'No of rows: {rows}')
print(f'No of columns: {columns}')
dfColumnNames

In [None]:
grader.check("q4")

## Q5 - Pandas 

Show the first 4 rows of the DataFrame.

Show the last 7 rows of the DataFrame.

In [None]:
first_4_rows = ...
last_7_rows = ... 
print(first_4_rows)
print(last_7_rows)

In [None]:
grader.check("q5")

## Q6 - Pandas 

Practice selecting different parts of the DataFrame

Select the `sh_wgt` column

Then, select both the `diam` and `vis_wgt` columns and only the first eight rows.

In [None]:
# select just the sh_wgt column 
shell_wgt = ...
shell_wgt

In [None]:
diam_vis_wgt = ...
diam_vis_wgt

In [None]:
grader.check("q6")

## Q7 - Pandas 

Select the following using the `.iloc` function: 
* `index_6` - row with index=5, the 6th row, of the DataFrame 
* `row_5_7` - the 5th and 7th row of the DataFrame, and 
* `ansC` - every other row and every third column starting from the 2nd row and 3rd column



In [None]:
index_6 = ...
index_6

In [None]:
df

In [None]:
ansC

In [None]:
row_5_7 = ...
row_5_7

In [None]:
ansC = ...
ancC

In [None]:
grader.check("q7")

## Q8 - Data Selection and Statistics 

Perform `mean()`, `max()`, and `min()`  for first 12 data points for all the weight columns.

*Hint: remember df.head(10) returns the first 10 rows of the DataFrame*

In [None]:
meanVals =  ...
meanVals

In [None]:
maxVals = ...
maxVals

In [None]:
minVals = ...
minVals

In [None]:
grader.check("q8")

## Q9 - Data Selection and Statistics 

Group by column "sex" and find the median for the other variables. 

In [None]:
group =  df ...
group

In [None]:
grader.check("q9")

## Q10 - Data Selection and Statistics 

Find the mean weights of abolone with more than 12 rings. 

In [None]:
mean_vals = ...
mean_vals

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)