# A brief introduction to pandas


Pandas is a powerful library that is used to manipulate and store your data. Most of the time, it reads data in from a file (usually csv, although it supports many different formats such as xlsx, json, etc) and converts it into a DataFrame - essentially, a table - so that you can explore, summarize and clean your data. Let's jump right in and see exactly what it can do!

For this tutorial, we are going to be using the [Used Cars Dataset](https://www.kaggle.com/austinreese/craigslist-carstrucks-data). 


In your jupyter notebook / python file, make sure to import the following: 

```python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
```

The other two libraries are there for additional utility that we might need. Matplotlib is a graphing library that is useful to create graphs, and numpy is a mathematical library that allows us to perform scientific computing in python.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

FILE_PATH = "../input/craigslist-carstrucks-data/vehicles.csv" 

In [None]:
plt.rcParams["figure.figsize"] = (13, 7)

## Reading the file

First things first, we need to be able to load our data in. This is where the `read_csv` function comes in handy. It takes in your file path, and it returns a dataframe ready for us to use. There are other read functions available, such as `read_excel` that suit many different file types.

Optionally, there are many different parameters that you can pass in that might be useful. One common parameter is `nrows`, which allow you to specify how many rows of the csv you want to read in. This is useful when the dataset is too large, and we only need a portion of it. 

In the example, our csv is quite large, so we are going to utilize `nrows` to tell pandas to only take the first 10,000 rows. 

```python
df = pd.read_csv(FILE_PATH, nrows=10000)
```

Make sure to set `FILE_PATH` to the actual path to the csv file! 


With that, we have our dataframe in our `df` object, and we're ready to explore our data.

In [None]:
df = pd.read_csv(FILE_PATH, nrows=10000)

## Quick look into the data

Before we work with our data, we first have to see what our data looks like! We can do that with `df.head()` to get the first few rows. By default, it returns the first 5 rows. We can change this by passing in an integer. For example, `df.head(10)` returns the first 10 rows.


Alternatively, we can do `df.tail()` to get the last few rows, or `df.sample()` to get a few random rows. Like `df.head()`, the default is 5 rows, and we can change it by passing in an integer. 

In [None]:
df.head()

In [None]:
df.tail()

Once we checked out a few rows in the dataframe, it's time to get some summary statistics. It's always a good idea to get some idea of what we're working with! There are two ways to get an idea of what we're working with: 

- `df.info()` returns information about columns, number of Non-Null cells in each column, and the data type of each column. This is useful to know how much data is missing from the dataset. 

- `df.describe()` returns the count, mean, standard deviation, min, 25th percentile, median, 75th percentile, and the max of all numerical columns. This is useful to have a quick idea of what the data might look like. However, be careful of relying on these too much, as they might not give you the full picture. Read more about [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) if you're interested in seeing an example!

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

## Indexing

Now that we're able to view some of the dataset, let's now take a look on how to view specific rows. There are two different ways to index into the dataframe: `iloc` and `loc`. We will take a closer look at an example of how they differ in a later section. 

`iloc` indexes based on the order of the rows. For example, if we want the 10th row, we would do `df.iloc[9]`. (Remember, we start indexing from 0!) 

`loc` indexes based on the index of the rows. For example, if we want the row that has the index 9, we would do `df.loc[9]`. 

In some cases, `df.loc[9]` and `df.iloc[9]` might return the same row. This occurs when the 9th row happens to have the index 9. We'll see an example of when this is not true in a later section.

In [None]:
df.iloc[9]

In [None]:
df.loc[9]

### Multiple rows

Similarly to how we can `:` notation to get a slice from a python list, we can use the `:` notation to get a slice from a dataframe using `iloc`. 

Likewise, if we have a list of the indexes for rows that we want, we can get them using `loc`.

In [None]:
df.iloc[0:2]

In [None]:
df.iloc[[1, 12, 40]]

In [None]:
df.loc[[1, 12, 40]]

## Selecting Columns

Sometimes we only want to see the values for a certain row. We can do that in two different ways: 

`df.column_name` and `df["column_name"]`. Whichever you use is up to you! However, be careful when your column name is not a valid name for a python variable (for example, "Column Name" is not a valid python variable since it contains a space). In that case, you have to use the square bracket notation. 

In [None]:
df["id"]

In [None]:
df.id

### Selecting Multiple Columns

Sometimes, we want to see more than one column. Easy! Just pass in the list of column names you want. Note that the order of the columns will be in the same order as the list you pass in. 

In the example below, although the region column occurs after the id column, the dataframe returned has the region column before the id column since it follows the order of the list.

In [None]:
df[["region", "id"]]

## Boolean Indexing

Sometimes we don't know the exact indexes of the rows we want, but we do know a specific condition that they must satisfy. Don't worry, pandas has you covered! Through the power of boolean indexing, we can achieve this. First, let's take a look at what a boolean series looks like. In this example, `df["year"] < 2010` means that we're checking the truth value of the predicate `x < 2010` for every row in the `year` column. 

In [None]:
df["year"] < 2010

Notice how every row corresponds to either `True` or `False`. With boolean indexing, we only take the rows that are `True`. 

In [None]:
df[df["year"] < 2010].head()

Notice how the rows with index 0 and 1 are ommited. This is because they are manufactured after (and including) 2010!

We can also multiple conditions at the same time. We're going to need [bitwise operators](https://realpython.com/python-bitwise-operators/) in order to do this. The ones that we often use are `& | ~`, corresponding to `and`, `or`, and `not` respectively.

In the example below, it would translate in plain english to "Cars that are manufactured by BMW before 2010."

In [None]:
cond1 = df["year"] < 2010
cond2 = df["manufacturer"] == "bmw"

df[cond1 & cond2].head()

## Difference between iloc and loc

Remember how I mentioned we were going to look at a closer example? This is a case where `iloc` and `loc` are going to differ. This occurs pretty often, so make you understand what's happening here.

In [None]:
df_old = df[df["year"] < 2010]
df_old.iloc[2]

In [None]:
df_old.loc[2]

Why does `df_old.iloc[2]` and `df_old.loc[2]` return different rows? In the first case, we are looking for the 3rd row of df_old, which happens to have index 5. In the second case, we are looking for the row with index 2, which is in fact our first row in the dataframe!

In [None]:
df_old.head()

## Dropping Columns

Sometimes we want to drop columns that does not contain any meaningful information, or because they contain too many missing (NaN) values. We can drop specific columns as such:

In [None]:
df.drop(columns=["Unnamed: 0", "size"])

## Dropping rows

Suppose that a column has some missing values. However, most rows do not have a missing value in that column. How do we remove rows that have a missing value in that column? We can do so with boolean indexing. However, since it is a common occurence, pandas also provides us with a method to do so easily. 

In [None]:
df.dropna(subset=["year"])

The example aboves drops rows that contain a missing value in the "year" column. If we wish to drop all rows with any missing values, we can simply call `df.dropna()` without any parameters.

## Creating new columns

Often, we wish to create new columns from existing columns. We can do so quite easily! There are two ways to do so:

When the operation can be vectorized, we can simply use existing columns to create new information. In the example below, we want to create a new column called "full_model", which combines the manufacturer and car model. 

In [None]:
df["full_model"] = df["manufacturer"] + " " + df["model"]
df["full_model"]

Sometimes however, we want to map a function over each rows to create a new value. We can make use of a powerful method in pandas called `apply`.

Suppose that we want the manufacturer + model + year for car models that are manufactured before 2000, and manufacturer + model for car models that are manufactured after 2000. Althought there are many different ways we can do so, we are going to make use of the apply function to make it straightforward.

Note the `axis=1` parameter. That is what tells pandas that we want to map the function across the rows. We can alternatively set `axis=0` to map functions across columns.

In [None]:
def f(x):
    result = f"{x['manufacturer']} {x['model']}"
    if x["year"] < 2000:
        return result + f" {(int(x['year']))}"
    else:
        return result

df.apply(f, axis=1)

## Statistics

Pandas also gives us quick and easy ways to get summary statistics from each column. 

In [None]:
df["price"].mean()

In [None]:
df["price"].std()

In [None]:
df["price"].sum()

In [None]:
df["price"].min()

In [None]:
df["price"].max()

## Column uniqueness

One of the most common questions we want to ask is "How many unique values are there for each column?". Pandas gives us a painless way to answer that question quickly. There are 3 methods introduced here: 

`.unique()`, which gives us all the unique values.

`nunique()`, which gives us the number of unique values. 

`value_counts()`, which gives us the number of rows with each unique value.

In [None]:
df["manufacturer"].unique()

In [None]:
df["manufacturer"].nunique()

In [None]:
df["manufacturer"].value_counts()

Going one step further, we can easily visualize value counts with a bar chart.

In [None]:
df["manufacturer"].value_counts().plot.barh()
plt.title("Distribution of manufacturers")
plt.show()

## Case study

Supposed that we are a resident of Auburn, and we decide to start looking for a new car. We don't have a lot of budget, so we decide to go for relatively cheap cars. More specifically, we define "cheap" cars as cars have a price of 0.75 * the average price of cars in the Auburn region.  However, we don't want poor cars, and so we decide to only  include cars in "good" or "excellent" condition. Among all these cars, which are the most expensive? 

First, we want to only select listings in the Auburn region.

```python 
df_case_study = df[df["region"] == "auburn"]
```

Next, we want the average price of the cars in the Auburn region, 

```python 
auburn_mean = df_case_study["price"].mean()
```

We can use boolean indexing to act as a "filter". For now, we store this boolean array. 

```python
cond_cheap = df_case_study["price"] <= 0.75*auburn_mean
```

We also want to select cars with "good" or "excellent" condition. We can make use of `|` here.
```python
cond_good = df_case_study["condition"] == "good"
cond_excellent = df_case_study["condition"] == "excellent"
cond_condition = cond_good | cond_excellent
```

Finally, we put everything together, and get the most expensive amongst them!

```python 
df_case_study = df_case_study[cond_cheap & cond_condition]
df_case_study.iloc[df_case_study["price"].argmax()]
```

In [None]:
df_case_study = df[df["region"] == "auburn"]
# df_case_study = df_case_study.drop(columns=["Unnamed: 0"])
# df_case_study = df_case_study.dropna(subset=["manufacturer"])
auburn_mean = df_case_study["price"].mean()
cond_cheap = df_case_study["price"] <= 0.75*auburn_mean
cond_good = df_case_study["condition"] == "good"
cond_excellent = df_case_study["condition"] == "excellent"
cond_condition = cond_good | cond_excellent

df_case_study = df_case_study[cond_cheap & cond_condition]
df_case_study.iloc[df_case_study["price"].argmax()]

### Slightly harder: 

Cars are cheap if they are lower than the average of the price of that manufacturer.

In [None]:
df_case_study = df_case_study.dropna(subset=["manufacturer"])
manu_to_mean = df_case_study.groupby("manufacturer")["price"].mean().to_dict()
is_cheap = df_case_study.apply(lambda x: manu_to_mean[x["manufacturer"]] >= x["price"], axis=1)
df_case_study[is_cheap]