# Pandas

* [datacamp-cheatsheet](http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3)
* [python-cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

In [1]:
import pandas as pd

## Pandas

`pd.DataFrame()`
 constructor to generate these DataFrame objects

In [2]:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

Unnamed: 0,Bob,Sue
Product A,I liked it.,Pretty good.
Product B,It was awful.,Bland.


## Series


A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list.

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [4]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [5]:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

2015 Sales    30
2016 Sales    35
2017 Sales    40
Name: Product A, dtype: int64

## Accessors

We can access the property of an object by accessing it as an attribute. 

`reviews.country`

If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:
`reviews['country']`

To drill down to a single specific value, we need only use the indexing operator [] once more
`reviews['country'][0]`


## iloc

**index-based selection**: selecting data based on its numerical position in the data. When we use **iloc** we treat the dataset like a big matrix (a list of lists), one that we have to index into by position

* To select just the second and third entries of the first columns:

`reviews.iloc[1:3, 0]`

* To select the last five elements of the dataset

`reviews.iloc[-5:]`

## loc 
The second paradigm for attribute selection is the one followed by the loc operator: **label-based selection**. In this paradigm, it's the data index value, not its position, which matters.

`reviews.loc[:, ['taster_name', 'taster_twitter_handle', 'points']]`

```
indices = [1, 2, 3, 5, 8]
sample_reviews = reviews.loc[indices]```

`df = reviews.loc[:99, ['country','variety']]`

## iloc vs loc
In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999]

## index

`reviews.set_index("title")`

## Conditional Selection

* `reviews.loc[reviews.country == 'Italy']`
* `reviews.loc[(reviews.country == 'Italy') & (reviews.points >= 90)]`
* `reviews.loc[(reviews.country == 'Italy') | (reviews.points >= 90)]`

**built-in conditional selectors**
* `reviews.loc[reviews.country.isin(['Italy', 'France'])]`
* `reviews.loc[reviews.price.notnull()]`


* `top_oceania_wines = reviews.loc[reviews.country.isin(['Australia', 'New Zealand']) & (reviews.points >=95)]`

## Assigning Data
 * `reviews['critic'] = 'everyone'`

## Summary Functions

* `reviews.points.describe()`
* `reviews.points.mean()`
* `reviews.taster_name.unique()`
* `reviews.taster_name.value_counts()`

## Maps
A map is a term, borrowed from mathematics, for a function that takes one set of values and "maps" them to another set of values.

Maps allow us to transform data in a DataFrame or Series one value at a time for an entire column.
The function you pass to map() should expect a single value from the Series (a point value, in the above example), and return a transformed version of that value. map() returns a new Series where all the values have been transformed by your function.

`review_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p - review_points_mean)`


***
```
n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_frui = reviews.description.map(lambda desc: "fruity" in desc). sum()
descriptor_counts = pd.Series([n_trop, n_frui], index=['tropical', 'fruity'])```

## Apply

```def remean_points(row):
    row.points = row.points - review_points_mean
    return row```

```reviews.apply(remean_points, axis='columns')```

### Which wine is the "best bargain"?

```
bargain_idx = (reviews.points / reviews.price).idxmax
bargain_wine = reviews.loc[bargain_idx, 'title']```

## Translate points into scores taking an account an ad payer from Canada

```
def rate(row):
    if row.country  == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >=85:
        return 2
    else:
        return 1

star_ratings = reviews.apply(rate, axis='columns')
```

## Groupby

```df.groupby(['col1','col2']).size().reset_index(name='count')```


```df.groupby('column_name').column_name.count()``` = ```value_counts()``` = ```df.groupby('column_name').column_name.size()``` <br>

### Other fn's
We can use any of the summary functions we've used before with this data.<br>
```df.groupby('column_name').column_name.min()``` 

```

### Apply()
You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the **apply()** method, and we can then manipulate the data in any way we see fit. For example, here's one way of selecting the name of the first wine reviewed from each winery in the dataset:

```reviews.groupby('winery').apply(lambda df: df.title.iloc[0])```

### Group by > one column
For even more fine-grained control, you can also group by **more than one column**. For an example, here's how we would pick out the best wine by country and province:

```reviews.groupby(['country', 'province']).apply(lambda df: df.loc[df.points.idxmax()])```


## Agg
Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:

```reviews.groupby(['country']).price.agg([len, min, max])```

## Multi-indexes
A multi-index differs from a regular index in that it has multiple levels. 

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value

```countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])```

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

```df.reset_index()```

## Sort

```df.sort_values(by='column_name', ascending=False)```

To sort by index values, use the companion method sort_index():
    ```df.sort_index()```

Sort more than one column:

```df.sort_values(by=['country', 'len'])```

## DTypes

```df.column_name.astype('float64')```

## Missing Data

### null
```df[pd.isnull(df.column_name)]```

### notnull
```df[pd.notnull(df.column_name)]```

**Examples**

```n_missing_prices = len(reviews[reviews.price.isnull()])```

Cute alternative solution: if we sum a boolean series, True is treated as 1 and False as 0 <br>
```n_missing_prices = reviews.price.isnull().sum()```

or equivalently:<br>
```n_missing_prices = pd.isnull(reviews.price).sum()```

### filling values
```df.column_name.fillna("Unknown")```

### replace
```df.column_name.replace("@kerinokeefe", "@kerino")```

**Summary**

```reviews.region_1.fillna("Unknown").value_counts().sort_values(ascending=False)```

## Rename
```df.rename(columns={'column_name': 'new_column_name'})```

```renamed = df.rename(columns=dict(column_1='new_column1', column2='new_column2'))```

## Concat
```pd.concat([df1, df2])```