In [None]:
import numpy as np
import pandas as pd

## Comparison UFuncs as Array Masks

Here is a brief review of what we've covered in the last lecture:

1. Seen how the comparison ufuncs (np.equal, np.less, np.greater, etc) generate boolean arrays that indicate whether a given element of an array meets (or doesn't meet) the condition of the function.

1. We then showed how you could pass these boolean areas to `np.sum`, `np.all`, and `np.any` to derive additional information on your data set.

1. Finally, we demonstrated how you could logically compare two boolean arrays with the **bitwise** operators to perform multistep data comparisons.

For the last segment of this tutorial, we are going to demonstrate using comparison functions to return the original items of the array that is being evaluated instead of a boolean array.

#### Array Masking
In the last lecture, we showed how you could select data from an array using index or slice notation. Here we will introduce another data selection technique called **masking**.

Basically, it looks a lot like slice notation. In case you've forgotten what that looks like, here is a reminder.

In [None]:
simple_int_array = np.array([5, 3, 4, 9, 8, 2, 1, 7, 6, 0])
simple_int_array

In [None]:
# Slice elements indexed 5, 6, 7 of our simple_int_array
simple_int_array[5:8]

The difference with a mask is that instead of putting `[start:stop:step]` inside the brackets, you actually invoke a comparison function.

In [None]:
simple_int_array < 7

In [None]:
# Let's return all the values of simple_int_array that are less than 7
simple_int_array[simple_int_array < 7]

**This way of masking is very important and used a lot.** The above statement works the following way
1. The comparison UFunc inside the brackets is evaluated first. 
1. It returns a boolean array where the first 7 elements have `True` value, and the rest have `False`.
1. For each index of the boolean array with a `True` value, the corresponding index of the original array is returned.

In [None]:
# At this point you don't have to know the details of following data loading. 
# However, understand that it is loading the weights of all the athletes
nd_player_weights = np.array(pd.read_csv('./data/nd-football-2021-roster.csv')['Weight'])
nd_player_names = np.array(pd.read_csv('./data/nd-football-2021-roster.csv')['Name'])
nd_player_heights = np.array(pd.read_csv('./data/nd-football-2021-roster.csv')['Height'])

## Activity:

* Names of all players above 75 inches? 
* Names of all players above 220 lbs and below 250 lbs? 

**Hint**: You can use use boolean array created from one array as mask to another array

## np.unique

Returns the sorted unique elements of an array. [More info](https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html)

In [None]:
sample_array = np.array([1,2,2,1,2,3,2,23,2,1,3,2])
np.unique(sample_array)

# NumPy: Fancy Indexing

In [None]:
simple_array = np.array([10,20,30,40,50,60])
simple_array

In [None]:
simple_array[3]

In [None]:
simple_array[[5,0]]
# This is equivalent to np.array([simple_array[5], simple_array[0]])

In [None]:
np.array([simple_array[5], simple_array[0]])

## Application: np.argsort()

Returns the indices that would sort an array. [More Info](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html)



In [None]:
x = np.array(['d', 'a', 'b', 'x'])
np.argsort(x)

In [None]:
x[np.argsort(x)]

# NumPy: Broadcasting

## Case 1

In [None]:
x = np.arange(3)
y = 5
print(x)
print(y)
print(x.shape)
print(x+y)

## Case2

In [None]:
x = np.random.randint(10, size =((3,3)))
y = np.random.randint(10, size = 3)
print("x array")
print(x)
print("y array")
print(y)

print("Their shapes are respectively")
print(x.shape)
print(y.shape)

In [None]:
x - y

## Case 3

In [None]:
x = np.random.randint(10, size=(3,1))
y = np.random.randint(10, size = 3)
print("x  array")
print(x)
print("y array")
print(y)

print("Their shapes are respectively")
print(x.shape)
print(y.shape)

In [None]:
x - y

# Detour to Dictionary

In [None]:
my_dict = dict()
print(my_dict)

In [None]:
my_dict['Jan'] = 1
my_dict['Feb'] = 2
my_dict['Mar'] = 3

my_dict

In [None]:
my_dict['Jan']

In [None]:
# You can update the dictionary through the key
my_dict['Jan'] = my_dict['Jan'] + 4

In [None]:
my_dict

In [None]:
# Accessing elements not in the dict
my_dict['Dec']

In [None]:
# To check if a key is in the list
'Dec' in my_dict

In [None]:
'Mar' in my_dict

## You can access the keys and values seperately

In [None]:
print(my_dict.keys())
print(my_dict.values())

list(my_dict.keys())

In [None]:
np.array(list(my_dict.keys()))

# Introduction to Pandas

<div class="alert alert-block alert-info">
<p> Source: Example datasets and discussion in this Jupyter Notebook was partly sourced from `Mike Dunn`, University of Notre Dame.  </p>
</div>

In [None]:
import pandas as pd
pd.__version__

## `DataFrame` & `Series` Basics

### Basics on data loading

<div class="alert alert-block alert-info">
<h5>Know your current working directory</h5>

<p>`import os`</p>
<p>`os.getcwd()`</p>

</div>

In [None]:
import os
print(os.getcwd())

<div class="alert alert-block alert-danger">
<h5>Make sure the data is in the right place. </h5>
<p> </p>
<li> Open the above folder location, that was printed as an output of `os.getcwd()` command, using Windows (or Finder on Mac) file system</li>
<li> Make sure there is a folder named 'data' in the location you opened. If not create a folder</li>
<li> Copy the dowloaded data into the newly created 'data' folder </li>

</div>

In [None]:
print(os.listdir('./data/'))

When you specify './data/' in the above and below Python command, '.' means the current directory. 

In the following statement, the interpretation is that in the current directory as this Jupyter file, open the 'data' folder and look for 'nd-football-2018-roster.csv' file. 

In [None]:
athletes_data = pd.read_csv('./data/nd-football-2021-roster.csv')
type(athletes_data)

<div class="alert alert-block alert-info">
The `type` function return the type of the object passed to it. Very handy!
</div> 

### `DataFrames` are made up of an `index` and one or more `Series`
Inside of every frame is an **`index`** and one or more **`Series`** objects.
Let's demonstrate this by looking at the first few elements of our `athletes_data` object.

In [None]:
#DataFrame athletes_data.head() provides the first few rows of the dataset

athletes_data.head()

The bold numbers running down the left hand side are the **`index`** of the **`DataFrame`**.  The bold strings running across the top are the names of the nested **`DataSeries`** objects.

In [None]:
name_series = athletes_data['Name']
type(name_series)

<div class="alert alert-block alert-info">
<h5>Dictionary Like-Retrieval</h5>
<p>Did you see how I passed to the name of the **`DataSeries`** object that I wanted to the `athletes_data` frame? It was the same sort of syntax you'd use to retrieve a data element from a **`dict`**.</p>
<p>
As we continue to move along, we'll discover that **`DataFrame`** and **`dict`** types share many traits.
</p>
</div> 

### Every `Series` is made up of an index and a NumPy array
Now that we know every **`DataFrame`** is filled with **`Series`** objects, let's inspect `name_series` dig deeper into the data structures.

In [None]:
# Let's ask for the string representation of the object.
# You can ignore the slice notation at the end,
# I just don't want to display all the names.
name_series[0:10]

So, as you can see, we've got two columns here.  
* The first column is the **`index`**.
* The second column, which holds the values of the series is nothing more than our good friend, the NumPy array.

You can retrieve the index and NumPy array separately from a series as follows:

In [None]:
# Get the Series index
name_series.index

In [None]:
name_series.values

## Going a Bit Deeper
The essential difference between an NumPy **`ndarray`** and a Pandas **`Series`** object is their indexes.

**NumPy arrays have indexes as well, but they are implicit and always integers**. You can't access an array's **`index`** property directly like you can on a Pandas series object as we did above.

Furthermore, series objects are not limited to having integer based indexes. You could have indexes of strings, floats, booleans, dates, etc. 

In [None]:
# Create a `DataSeries` object from a dictionary
# This results in a string based index.

sample_dict = {'R':'Not as cool. :(',
                'Python': 'Best Language Ever!',
                'C': 'Fundamental language',
                'Julia': 'A New language for Data Science'}

simple_series = pd.Series(sample_dict)
simple_series.index

In [None]:
simple_series

In [None]:
simple_series['Python']

In [None]:
simple_series['R':'Python']

When we loaded our `athletes_data` frame from the CSV file, it generated an integer based index, which is the default behavior.

But we could change that.  For instance, we could make the institution names the index values:

In [None]:
athletes_data = pd.read_csv('./data/nd-football-2021-roster.csv', index_col = 'Name')

athletes_data.head()

**NOTE**: In the last read_csv, we are using `index_col` keyword argument to provide the column we want to use as an index. This is a common way of loading CSV when we want a specific column to be an index. 

# Data Indexing and Selection

You'll find that many of the same techniques that we used with NumPy arrays will also be available for these objects. In addition, they add some additional functionality that will be very familiar to anyone who has experience with Python dictionaries.

In [None]:
# THE FOLLOWING CODE WILL GIVE YOU AN ERROR
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv')
college_scorecard.head()

### Encoding

Text files are encoded in different formats when they are written. To read them, you must decode them with the same standard or you'll have a problem.

For example, our `college-scorecard-data-scrubbed.csv` file was encoded using `latin-1`, but the default setting for Pandas in Python 3 is `utf-8` so we will get an error if we try to read the file without specify the correct encoding like so:

In [None]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')
college_scorecard.head()

### Setting the appropriate column as an index while loading

In [None]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1', 
    index_col='institution_name')
college_scorecard.head()

**NOTE**: In the last read_csv, we are using `index_col` keyword argument to provide the column we want to use as an index. This is a common way of loading CSV when we want a specific column to be an index. 

## Selecting Data from `Series` Objects

Let's start by grabbing the `url` series object out of our data frame:

In [None]:
url_series = college_scorecard['url']
url_series.head()

As a reminder, a `Series` object is comprised of an explicit index and the values. **Notice here that our `Series` object inherit the 'institution_name' column values as the index from the `DataFrame`.**

### Dictionary Like Features

Several of the methods available on Python **`dict`** objects are also available on `Series` objects. The reason that this is possible is because Pandas maintains a mapping relationship between the explicit index elements and the Series values - just like standard Python does between the keys & values of a dictionary.  

#### Membership Testing with `in`
You can determine if a given **index** exists in a `Series` using the `in` keyword:


In [None]:
'University of Notre Dame' in url_series

#### Value Retrieval via Index "Key"
You can retrieve a value from the `Series` by passing it the index "key" you are interest in.b

In [None]:
url_series['University of Notre Dame']

In [None]:
url_series['Carnegie Mellon University']

### Array Like Features
Now we will explore some of the array like features of `Series` objects. Most of this will be familiar given what you already know about NumPy arrays, so we will move quickly.

#### Slicing with Explicit Indexes & Implicit Indexes
Slicing is pretty straight forward with NumPy arrays because of their implicit integer based indexes. It gets a little bit more complicated with `Series` objects because the explicit index isn't necessarily integer based.

Just like normal slice, you can specify two elements that you want to be the start/end of what is returned. The difference here is that you can specify the actual index element names/keys instead of numbers.

Here will we ask for all the listings from Stanford to Notre Dame.

In [None]:
url_series['Stanford University':'University of Notre Dame']

<div class="alert alert-block alert-info">
<p>
It is important to note that the reverse request, `url_series['University of Notre Dame': 'Stanford University']` would have yielded no results.
</p>
<p>
This is because 'University of Notre Dame' appears after 'Stanford' in the CSV file. Remember that technically, the first item in a slice notation is the 'start' and the second is the 'end'. It is important that you have them in the right order.
</p>
</div> 

<div class="alert alert-block alert-danger">
<h5>Warning: Important Distinction</h5>
<p>
In a NumPy array slice (or when using an implicit index), the 'end' value of the slice notation is not included in the return slice.
</p>
<p>
Strangely, when using a slice with an explicit index - the end value is included. Be careful about this as you could end up with an extra record in your slices that you don't want.
</p>
</div> 

##### The Implicit Index Lurking in the Shadows

While it is true that every `Series` object has an explicit index - it is also true that there is also an implicit index that is always available. Because of this, you can continue to use "normal" slice notations on `Series` objects with non-integer based explicit indexes.

Here are a couple of examples.

In [None]:
# Using "normal" slice notations on our `url_series`
# First ten elements
url_series[0:10]

<div class="alert alert-block alert-danger">
<h5>Important Warning! Implicit vs. Explicit indexing</h5>
<p>
A confusing situation arises when you have a `Series` with an explicit integer index that doesn't start with 0 and increment 1 for each element.
</p>

<p>
Slice notations get convoluted in this case and you have to use some ** special attributes (.loc, .iloc, .ix) that are discussed in your textbook on page 109-110** to keep things straight. 
</p>
</div> 

#### Series Masking
You can do masking on `Series` objects in the same way you did so with NumPy Arrays. Review the Jupyter Notebook for Sept 19th for more information on masking using NumPy

Here a couple of examples:

In [None]:
# Let's get a new Series object with numeric data on SAT average scores.
sat_average_series = college_scorecard['sat_average']

In [None]:
sat_average_series.head()

In [None]:
sat_average_series > 1200

In [None]:
# Return schools with SAT averages over 1200
sat_average_series[sat_average_series > 1200]

## Activity:

1. What schools have averages between 1400 & 1500
1. Is University of Notre Dame one of the schools? 
1. How about 'Harvard University'? 


## Selecting Data from `DataFrame` Objects

Similiarly to what we found with `Series` objects. You can interact with `DataFrame` objects in ways that sometimes resemble a dictionary and other times a NumPy array.

### Dictionary Like Features


In [None]:
# You can retrieve an individual Series from a DataFrame
# by passing the Series name/key to the DataFrame
college_scorecard['religious_affiliation_desc']

In [None]:
# Test for the existence of a given column/Series in a DataFrame
'city' in college_scorecard

<div class="alert alert-block alert-warning">
<p> Note the distiction with `in` operator on a `Series` and on a `DataFrame`. When you use it on a `Series` it checks if it is present in the index. Whereas for a `DataFrame`, it check if it is present in the columns
<div>

### Array Like Features

#### Slicing (Explicit Index)
Slicing affects affects rows, not columns in a `DataFrame`. In other words, you can slice based on the index values, but not the column values. Let's get a slice of all rows from 'Alaska Bible College' to 'Alabama State University':

In [None]:
college_scorecard['Alaska Bible College': 'Alabama State University']

<div class="alert alert-block alert-info">
<p>
You can however use the `iloc`, and `loc` methods to slice based on columns.  **You can look into this on pages 113-114 of your textbook if you are interested.**</p>
</div> 

#### Slicing (implicit index)
You can also rely on the implicit integer index of the `DataFrame` (yes, it has one too) to retrieve rows by the numeric index.

**Just remember, the 'end' value of the slice is not included when using the implicit index.**

In [None]:
college_scorecard[0:5]

#### Masking

Masking operations likewise return rows from a `DataFrame`, but the **criteria of the masks will be a comparison on one of the columns/Series**. This is somewhat confusing sounding, so let's just demonstrate:

In [None]:
# Return all rows where the 'state' Series has a value of 'AK'
college_scorecard[  college_scorecard['state'] == 'AK'  ]

In [None]:
# Which colleges in IN offer Bachelors degrees?
# Again, notice the parathesis here
college_scorecard[ (college_scorecard['state'] == 'IN') & (college_scorecard['predominant_degree_desc'] == 'Bachelors') ]

### Selecting Multiple Columns of DataFrame

In [None]:
two_columns = college_scorecard[ ['state', 'predominant_degree_desc'] ]
two_columns.head()

**NOTE**: Among the two sets of square brackets `[[ ]]`, the first set is used to select the columns, the second set is used to list the columns you want to select. 

## Activity On Football Athletes Data

1. Details of the players who are in freshmen class?
1. Details of the players whose position is defensive linemen (DL) and are in their the senior class? 
1. Average height of players whose position is defensive linemen (DL) and are in their the senior class? 

In [None]:
athletes_data = pd.read_csv('./data/nd-football-2021-roster.csv', index_col='Name')
athletes_data.head()

## UFunc Arithmatic with Index Preservation

Let us convert the height of the players into meters. The math to convert from inches to meters is to multiply by 0.0254. 

In [None]:
athletes_data['Height']*0.0254


Do you see how my index was still preserved? This is referred to as **index preservation** and we will see it come into play both for `Series` and `DataFrame` objects when we using arithmetic functions on them.

## Binary Functions and `DataFrame` Objects
Now let's try performing binary UFunc operations on DataFrames.

#### Operations between 2 DataFrames

To demonstate how arithmetic operations work between two different `DataFrame` objects I'll need to construct a couple of simple objects.

I'll go ahead and create two imaginary objects that hold sales data over two different years for the burger joint: **In-N-Out**

In [None]:
import pandas as pd

# 2015 Sales DataFrame
sales_2015 = pd.DataFrame([
        {'Burgers': 9574265, 'Fries': 7124736, 'Drinks': 11563762},
        {'Burgers': 6574265, 'Fries': 5124736, 'Drinks': 13563762},
    ], 
    index=['California', 'Texas'])

# 2016 Sales DataFrame
# They open their first Indiana store at Notre Dame!!!
# And they sell Irish Shakes nationwide to celebrate.
sales_2016 = pd.DataFrame([
        {'Burgers': 9742652, 'Fries': 7354736, 'Drinks': 11133762, 'Irish Shakes': 75812},
        {'Burgers': 7774222, 'Fries': 6214736, 'Drinks': 14563762, 'Irish Shakes': 15525},
        {'Burgers': 74265, 'Fries': 54736, 'Drinks': 43762, 'Irish Shakes': 23612},
    ], 
    index=['California', 'Texas', 'Indiana'])


Here's what those `DataFrames` look like separately:

In [None]:
print(sales_2015, sales_2016, sep='\n\n\n')

In [None]:
sales_2015 + sales_2016

To have a value in the results of an operation between two `DataFrame` objects, there must be a value in both of the objects for a given Index/Column combination.

This is why there is no data for Indiana in our results (that index only existed in 2016) and no results for Irish Shakes (that column only existed in 2016).

## Loading JSON Files
In terms of web APIs, JSON is the dominant data transmission format on the internet right now - so you'll need to be familar with how to load it into **`DataFrame`** objects as well.

There are a wide variety of ways that JSON documents can be structured. Unless you want to really start getting down into the,  there are really only a few formats that Pandas will read without problems.

For our purposes, we'll use a pretty basic file that conforms to one of the standard formats just to get our feet wet.

I've uploaded a JSON formatted file `pokedex.json` for us to use.  Hopefully, you are a Pokemon fan.

In [None]:
# We use the `orient` parameter to tell Pandas what the basic 
# structure of the JSON is.  The other options are:
# split, index, columns, and values
# More Info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
pokedex = pd.read_json('./data/pokedex.json', orient='records')
pokedex.head()

## Practice Dictionary Activity

1. Accept a string as an input from the user
2. Create a dictionary that contains the frequency of each word in the string.
3. Print the dictionary

Below is the sample interaction

In [None]:
# Accept a setence
sent = input("Enter a sentence: ")

# Convert the sentence to a list of words using split() function
words_list = sent.split()

# Create an empty dictionary that contains words and frequencies


# Iterate through every word in the list and create the word-count dictionary

# Finally print the dictionary
