-------
#Pandas DataFrame Data Structure
-------

The **DataFrame** data structure is the heart of the Panda's library. It's the primary object that you'll be working with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, with an index and multiple columns of content, where each column has a label. In fact, the distinction between a column and a row is really only a conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.

##Creating DataFrames

In [69]:
# Let's start by importing our pandas library
import pandas as pd

Let's start with an example. Let's create three school records for students and their class grades. I'll create each as a series which has a student name, the class name, and the score. 

In [None]:
record1 = pd.Series({'Name': 'Alice',
                     'Class': 'Physics',
                     'Score': 85})
record2 = pd.Series({'Name': 'Jack',
                     'Class': 'Chemistry',
                     'Score': 82})
record3 = pd.Series({'Name': 'Helen',
                     'Class': 'Biology',
                     'Score': 90})

Like a Series, the DataFrame object is also indexed. Here we'll use a group of series, where each series represents a row of data. Just like the Series function, we can pass in our individual item in an array, and we can pass in our index values as a second arguments.

And just like the Series we can use the `head()` function to see the first several rows of the dataframe, including indices from both axes, and we can use this to verify the columns and the rows

In [None]:
df = pd.DataFrame([record1, record2, record3],
                  index=['school1', 'school2', 'school1'])

df.head()

You'll notice here that Jupyter creates a nice bit of HTML to render the results of the dataframe. So we have the index, which is the leftmost column and is the school name, and then we have the rows of data, where each row has a column header which was given in our initial record dictionaries.

An alternative method is that you could use a list of dictionaries, where each dictionary represents a row of data. We then pass this list of dictionaries into the `pd.DataFrame()` function.

In [None]:
students = [{'Name': 'Alice',
              'Class': 'Physics',
              'Score': 85},
            {'Name': 'Jack',
             'Class': 'Chemistry',
             'Score': 82},
            {'Name': 'Helen',
             'Class': 'Biology',
             'Score': 90}]

df = pd.DataFrame(students, index=['school1', 'school2', 'school1'])
df.head()

##Accessing Data in DataFrames

Similar to the series, we can extract data using the `.iloc` and `.loc` attributes. Because the DataFrame is two-dimensional, passing a single value to the loc indexing operator will return the series if there's only one row to return.

For instance, if we wanted to select data associated with school2, we would just query the `.loc` attribute with one parameter.

In [None]:
df.loc['school2']

You'll note that the name of the series is returned as the index value, while the column name is included in the output.

We can check the data type of the return using the python `type` function.

In [None]:
type(df.loc['school2'])

It's important to remember that the indices and column names along either axes horizontal or vertical, could be *non-unique*. 

In this example, we see two records for school1 as different rows. If we use a single value with the DataFrame `.loc` attribute, multiple rows of the DataFrame will return, not as a new series, but as a new DataFrame.

In [None]:
df.loc['school1']

And we can see the type is now different too...

In [None]:
type(df.loc['school1'])

One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes.

For instance, if you wanted to just list the student names for school1, you would supply two parameters to `.loc`, one being the row index and the other being the column name.

In [None]:
df.loc['school1', 'Name']

Remember, just like the Series, the pandas developers have implemented this using the indexing operator and not as parameters to a function.

What would we do if we just wanted to select a single column though? Well, there are a few mechanisms. Firstly, we could transpose the matrix. This pivots all of the rows into columns and all of the columns into rows, and is done with the `T` attribute.

In [None]:
df.T

Then we can call `.loc` on the transpose to get the student names only

In [None]:
df.T.loc['Name']

However, since `iloc` and `loc` are used for row selection, Pandas reserves the indexing operator directly on the DataFrame for column selection. In a Pandas DataFrame, columns always have a name. So this selection is always label based, and is not as confusing as it was when using the square bracket operator on the series objects. For those familiar with relational databases, this operator is analogous to column projection.

In [None]:
df['Name']

In practice, this works really well since you're often trying to add or drop new columns. However, this also means that you get a key error if you try and use `.loc` with a column name.

In [None]:
df.loc['Name']

Note too that the result of a single column projection is a Series object

In [None]:
type(df['Name'])

Since the result of using the indexing operator is either a DataFrame or Series, you can chain operations together. For instance, we can select all of the rows which related to school1 using `.loc`, then project the name column from just those rows.

In [None]:
df.loc['school1']['Name']

If you get confused, use type to check the responses from resulting operations

In [None]:
print(type(df.loc['school1'])) #should be a DataFrame
print(type(df.loc['school1']['Name'])) #should be a Series

*Chaining*, i.e. indexing on the return type of another index, can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame.

For selecting data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error.

Here's another approach. As we saw, `.loc` does row selection, and it can take two parameters, the row index and the list of column names. The `.loc` attribute also supports slicing.

If we wanted to select all rows, we can use a colon to indicate a full slice from beginning to end. This is just like slicing characters in a list in python. Then we can add the column name as the second parameter as a string. If we wanted to include multiple columns, we could do so in a list. and Pandas will bring back only the columns we have asked for.

Here's an example, where we ask for all the names and scores for all schools using the `.loc` operator.

In [None]:
df.loc[:,['Name', 'Score']]

Take a look at that again. The colon means that we want to get all of the rows, and the list in the second argument position is the list of columns we want to get back.

Another shortcut way to get the same results, if you want all rows, is to remove `.loc` attritubute and just index the list of columns.

In [None]:
df[['Name','Score']]

That's selecting and projecting data from a DataFrame based on row and column labels. The key concepts to remember are that the rows and columns are really just for our benefit. Underneath this is just a two axes labeled array, and transposing the columns is easy. Also, consider the issue of chaining carefully, and try to avoid it, as it can cause unpredictable results, where your intent was to obtain a view of the data, but instead Pandas returns to you a copy.

The below provides a quick summary overview of accessing data in a DataFrame:

![dataframe.png](https://drive.google.com/uc?id=1t83tb2TSBEojwRU6MWC-Xzp_7AcdXZFj)

##Dropping Data in DataFrames

Let's talk about dropping data. It's easy to delete data in Series and DataFrames, and we can use the `.drop()` function to do so. This function takes a single parameter, either the index or row label, to drop. 

Note though that the drop function doesn't change the DataFrame by default! Instead, the `.drop()` function returns a copy of the DataFrame with the given rows removed.

In [None]:
df.drop('school1')

But if we look at our original DataFrame we see the data is still intact.

In [None]:
df

Drop has two interesting optional parameters. The first is called `inplace`, when set to `True`, the DataFrame will be updated in place and no copy will be returned. The second parameter is `axis`,where we identify which Axis should be dropped. By default, this value is `0`, indicating the row axis. In order to drop a column, you need to change it to `1`.

In [None]:
# Now lets drop the Name column
df.drop("Name", inplace=True, axis=1)
df

There is a second way to drop a column, and that's directly through the use of the indexing operator and the `del` keyword. This way of dropping data, however, takes immediate effect on the DataFrame and does not return a copy.

In [None]:
del df['Class']
df

Finally, adding a new column to the DataFrame is as easy as assigning it to some value using the indexing operator. For instance, if we wanted to add a class ranking column with default value of `None`, we could do so by using the assignment operator after the square brackets. This broadcasts the default value to the new column immediately.

In [None]:
df['ClassRanking'] = None
df

##DataFrame Indexing and Loading

Throughout the course, we'll be largely using smaller or moderate-sized datasets. A common workflow is to read the dataset, usually from some external file, then begin to clean and manipulate the dataset for analysis. So, let's demonstrate how you can load data from a comma separated file directly into a DataFrame.

Let's talk about comma separated values (csv) files. You've undoubtedly used these - any spreadsheet software like excel or google sheets can save output in CSV format. It's pretty loose as a format, incredibly lightweight, and totally ubiquitous.

Pandas mades it easy to turn a CSV into a DataFrame by just calling the `.read_csv()` function.

In [None]:
# Again, to access the file, you need to mount the drive. 
# If you are running jupyter notebook locally no need to do this step.
from google.colab import drive
drive.mount('/content/drive')
!ls /content/drive/My\ Drive/Applied\ Data\ Science\ in\ Python/datasets/  # Running a line with a "!" in the start is identical to running a bash script

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/My Drive/Applied Data Science in Python/datasets/Admission_Predict.csv')

# And let's look at the first 5 rows
df.head()

We see from the output that there is a list of columns, and the column identifiers are listed as strings on the first line of the file. Then we have rows of data, all columns values are values separated by commas in the original file. 

So now we have all the values in the file organized in a "table-like" format, aka the DataFrame. Notice, however, that by default the index starts with 0 while the students' serial number starts from 1. If you take a look at the CSV output, that the index isn't there and you'll deduce that pandas has created its own new index. So, we can either keep it as it is, or instead, we can set the serial no. as the index if we want to by using the `index_col` parameter.

In [None]:
df = pd.read_csv('/content/drive/My Drive/Applied Data Science in Python/datasets/Admission_Predict.csv', index_col="Serial No.")
df.head()

Notice also that we have two columns "SOP" and "LOR", that many probably do not know what they mean. So let's change our column names to make it more clear. In Pandas, we can use the `.rename()` function, which takes a
parameter called `columns` that takes a dictionary where the keys are the old column names and the values are their corresponding new column names.

In [None]:
new_df=df.rename(columns={'SOP': 'Statement of Purpose','LOR': 'Letter of Recommendation'})
new_df.head()

From the output, we can see that only "SOP" is changed but not "LOR". Why is that? 

Let's investigate this a bit. First we need to make sure we got all the column names correct. We can use the `.columns` attribute of DataFrame to get a list.

In [None]:
new_df.columns

If we look at the output closely, we can see that there is actually a space right after "LOR" and a space right after "Chance of Admit". Sneaky, huh? So this is why our rename dictionary did not work for "LOR", because the key we used was just three characters, instead of "LOR ".

There are a couple of ways we could address this. One way would be to change the column name by including the space in the name.

In [None]:
new_df=new_df.rename(columns={'LOR ': 'Letter of Recommendation'})
new_df.head()

So that works well, but it's a bit fragile. What if that was a tab instead of a space? Or two spaces?

Another way is to create some function that does the cleaning and then tell `.rename()` to apply that function across all of the data. Python does that for us using a handy string function, called `strip()`, that strips white space. We can pass this function in to rename via the `mapper` parameter, which maps any function into the values of the axis indicated. Finally, we indicate whether the axis should be `"columns"` or `"index"` (row labels).

In [None]:
new_df=new_df.rename(mapper=str.strip, axis="columns")
# Let's take a look at results
print(new_df.columns)
new_df.head()

Now we've got it - both "SOP" and "LOR" have been renamed and "Chance of Admit" has been trimmed up. 

Remember though that the rename function isn't modifying the DataFrame. In this case, df is the same as it always was, there's just a copy in `new_df` with the changed names.

In [None]:
df.columns

We can also use the `df.columns` attribute by assigning to it a list of column names which will directly rename the columns. This will directly modify the original dataframe and is very efficient especially when you have a lot of columns and you only want to change a few. This technique is also not affected by subtle errors in the column names, a problem that we just encountered, it will just overwrite the existing names in the locations specified. With a list, you can use the list index to change a certain value or use list comprehension to change all of the values.

In [None]:
# As an example, lets change all of the column names to lower case. First we need to get our list
cols = list(df.columns)
# Then a little list comprehenshion
cols = [x.lower().strip() for x in cols]
# Change individual column names
cols[3]="statement of purpose"
# Then we just overwrite what is already in the .columns attribute
df.columns=cols
# And take a look at our results
df.head()

So far we've learned how to import a CSV file into a pandas DataFrame object, and how to do some basic data cleaning to the column names. The CSV file import mechanisms in pandas have [lots of different options](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), and you really need to learn these in order to be proficient at data manipulation. Once you have set up the format and shape of a DataFrame, you have a solid start to further actions such as conducting data analysis and modeling.

Now, there are other data sources you can load directly into dataframes as well, including HTML web pages, databases, and other file formats. However, the CSV file format is by far the most common data format you'll run into, and an important one to know how to manipulate in pandas.

##Querying DataFrames using Boolean Masks

Let's talk about querying DataFrames. The first step in the process is to understand **Boolean masking**. Boolean masking is the heart of fast and efficient querying in NumPy and Pandas, and its analogous to bit masking used in other areas of computational science.

A boolean mask is an array which can be of one dimension like a Series, or two dimensions like a DataFrame, where each of the values in the array are either `True` or `False`. This array is essentially overlaid on top of the data structure that we're querying. Any cell aligned with the `True` value will be admitted into our final result, and any cell aligned with a `False` value will not.

Here' an illustration of how a boolean mask would work in a DataFrame:

![boolean mask](https://drive.google.com/uc?id=1Of1IXHiESQssq6ZQTFa_sYn5brDuyjz4)

Let's look at our graduate admission dataset again.

In [None]:
df.head()

Boolean masks are created by applying operators directly to the pandas Series or DataFrame objects. For instance, in our graduate admission dataset, we might be interested in seeing only those students that have a chance higher than 0.7

To build a Boolean mask for this query, we want to project the "chance of admit" column using the indexing operator and apply the greater than operator with a comparison value of 0.7. This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean Series. The resultant Series is indexed where the value of each cell is either `True` or `False` depending on whether a student has a chance of admit higher than 0.7

In [None]:
admit_mask=df['chance of admit'] > 0.7
admit_mask.head()

This is pretty fundamental, so take a moment to look at this. The result of broadcasting a comparison operator is a Boolean mask; i.e. `True` or `False` values depending upon the results of the comparison. Underneath, pandas is applying the comparison operator you specified through vectorization (so efficiently and in parallel) to all of the values in the array you specified which, in this case, is the "chance of admit" column. In this case, the result is a series, since we are only applying the operator on one column.

So, what do you do with the boolean mask once you have formed it? Well, you can just lay it on top of the data to "hide" the data you don't want, which is represented by all of the `False` values. This can be done using the `.where()` function on the original DataFrame.

In [None]:
df.where(admit_mask).head()

We see that the resulting DataFrame keeps the original indexed values, and only data which met the condition was retained. All of the rows which did not meet the condition have `NaN` data instead, but these rows were not dropped from our dataset. 

Now, if we don't want the `NaN` data, the next step would be to use the `dropna()` function

In [None]:
df.where(admit_mask).dropna().head()

The returned DataFrame now has all of the `NaN` rows dropped. Notice the index now does not include 5.

Despite being really handy, `where()` isn't actually used that often. Instead, the pandas developers created a shorthand syntax which combines `where()` and `dropna()` at once. And, in typical fashion, they just overloaded the indexing operator to do this!

In [None]:
df[df['chance of admit'] > 0.7].head()

While I personally find this much harder to read, it's much more commonly used and you're more likely to see it in other people's code, so it's important to be able to understand it. 

You can use the indexing operator on a DataFrame in three ways:

1. It can be called with a string parameter to project a single column


In [None]:
df["gre score"].head()

2. You can send it a list of columns as strings


In [None]:
df[["gre score","toefl score"]].head()

3. You can send it a boolean mask


In [None]:
df[df["gre score"]>320].head()

Each of these is mimicing functionality from either `.loc()` or `.where().dropna()`.

###Combining Multiple Boolean Masks

Let's look at combining multiple boolean masks. What if we want to satisfy multiple criteria in our selection?

This is similar to bitmasking in computer science, where in order to satisfy multuple criteria, we use the "and" operator to extract the rows where all the conditions are satified. In the case where at least one of numerous conditions needs to be satisfied, then the "or" operator is used.

The truth table below summarizes how the "and" and "or" operators operate:

| A | B | A AND B | A OR B |
|---|---|---------|--------|
|False|False|False|False|
|False|True|False|True|
|True|False|False|True|
|True|True|True|True|


Unfortunately, it doesn't feel quite as natural in pandas. For instance, if you want to take two boolean series and "and" them together...

In [None]:
(df['chance of admit'] > 0.7) and (df['chance of admit'] < 0.9)

... we get an error. And despite using pandas for a while, I still find I regularly try and do this. The problem is that you have series objects, and python underneath doesn't know how to compare two series using "and" or "or". As a solution, the pandas authors have overwritten the ampersand `&` (instead of "and") and pipe `|` (instead of "or") operators to handle this.

In [None]:
(df['chance of admit'] > 0.7) & (df['chance of admit'] < 0.9)

A common error for new pandas users is to try and do boolean comparisons using the `&` operator but not putting parentheses around the individual terms you are interested in.

In [None]:
df['chance of admit'] > 0.7 & df['chance of admit'] < 0.9

The problem is that Python is trying to "bitwise and" the `0.7` and a pandas dataframe, when you really want to "bitwise and" the two broadcasted dataframes together.


Another way to do this is to just get rid of the comparison operator completely, and instead use the built in functions which mimic this approach.

In [None]:
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

These functions are built right into the Series and DataFrame object. That means that you can chain them too, which results in the same answer without the use of the boolean operators. You can decide what looks best for you.

In [None]:
df['chance of admit'].gt(0.7).lt(0.9)

In this lecture, we have learned to query dataframe using boolean masking, which is extremely important 
and often used in the world of data science. With boolean masking, we can select data based on the criteria 
we desire and, frankly, you'll use it everywhere. We've also seen how there are many different ways to query
the DataFrame, and the interesting side implications that come up when doing so.