# Pandas Data Processing
## Data transformation and cleaning

### Concatenation

- The addition of one dataset to another
- Typically used to **extend** a dataset with extra rows or columns
- To achieve this we can use the pandas `.concat()` method

#### Concatenating DataFrames

In [36]:
import numpy as np # we'll use this later
import pandas as pd

q1 = pd.read_csv('data/new-drugs-q1.csv')
q2 = pd.read_csv('data/new-drugs-q2.csv')
q3 = pd.read_csv('data/new-drugs-q3.csv')

Each file contains information about new prescription drugs made available in California during a given quarter of 2019 (based on data from [OSHPD](https://oshpd.ca.gov/visualizations/drugs-introduced-to-market/)).

In [37]:
for q in [q1, q2, q3]:
    print(q.shape, q.columns)

(49, 4) Index(['NDC Number', 'Date Introduced to Market', 'Manufacturer Name',
       'Drug Product Description'],
      dtype='object')
(111, 4) Index(['NDC Number', 'Date Introduced to Market', 'Manufacturer Name',
       'Drug Product Description'],
      dtype='object')
(46, 4) Index(['NDC Number', 'Date Introduced to Market', 'Manufacturer Name',
       'Drug Product Description'],
      dtype='object')


- We have read the data and created a DataFrame from each CSV file
- We have confirmed that they contain the same columns in the same order

In [38]:
df = pd.concat([q1, q2, q3], axis=0)

In [39]:
print(df.shape)
df.head(3)

(206, 4)


Unnamed: 0,NDC Number,Date Introduced to Market,Manufacturer Name,Drug Product Description
0,72626260101,2019-01-02,Asegua Therapeutics LLC,agLDV/SOF (ledipasvir 90 mg/sofosbuvir 400 mg)...
1,72626270101,2019-01-02,Asegua Therapeutics LLC,agSOF/VEL (sofosbuvir 400 mg/velpatasvir 100 m...
2,93765256,2019-01-03,Teva Pharmaceuticals USA,VARDENAFIL HCL TABLETS 2.5MG 30


- We used the pandas `.concat()` method, passing a list of DataFrames as the only argument
- The `axis` parameter determines whether to concatenate along **rows** or **columns**
    - The default `0` is used here to combine rows from DataFrames with shared column names
    - `axis=1` would be used to extend a dataset with additional columns

In [40]:
df.index

Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9,
            ...
            36, 37, 38, 39, 40, 41, 42, 43, 44, 45],
           dtype='int64', length=206)

- Notice that the **index labels are unchanged** from what they were in each individual DataFrame, i.e. there are duplicated values
- We can create new unique row index labels if we want to by passing `ignore_index=True` to the `.concat()` method 

![concat examples](img/concat.png)

In [41]:
df['NDC Number'].value_counts().max()

1

In [42]:
df_new = df.set_index('NDC Number')
df_new.head(2)

Unnamed: 0_level_0,Date Introduced to Market,Manufacturer Name,Drug Product Description
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
72626260101,2019-01-02,Asegua Therapeutics LLC,agLDV/SOF (ledipasvir 90 mg/sofosbuvir 400 mg)...
72626270101,2019-01-02,Asegua Therapeutics LLC,agSOF/VEL (sofosbuvir 400 mg/velpatasvir 100 m...


- We can see from using the `.value_counts()` Series method that `NDC Number` contains no duplicate values
- We used the `.set_index()` method to use `NDC Number` as our index in a new DataFrame assigned to `df_new`

### Joining datasets

- We often need to **join** (combine) datasets which have some **relationship** with one another
- The relationship (or **association**) requires a common **key** in each dataset so that they can be combined

**one-to-one joins**   
Each dataset contains the same number of shared, unique values in the key
 
**many-to-one joins**   
The first dataset has numerous instances of one or more of the values in the key while the second datset only has one instance of each value

**many-to-many joins**  
Both datasets have numerous instances of one or more of the values in the key


#### Using the pandas `.merge()` method

pandas uses terminology borrowed from **SQL** (a popular language for **querying databases**) in the syntax for its methods which provide functionality for joining datasets. 

![sql joins](img/joins.png)

#### Examples of different joins with small datasets

There are several small DataFrames created in the `dataframes.py` file; let's take a look at two of them and then see how they can be merged:

In [43]:
from dataframes import students, residents

display(students, residents)

Unnamed: 0,Name,Subject
0,Jesse,Physics
1,Kotryna,Biochemistry
2,Xiaoyi,Chemistry
3,David,Medicine


Unnamed: 0,Name,Age
0,Jesse,21
1,Kotryna,22
2,Xiaoyi,23
3,Raoul,29


We would like to add the `Age` column data to the `students` table:

In [44]:
students.merge(residents, how="inner", left_on="Name", right_on="Name")

Unnamed: 0,Name,Subject,Age
0,Jesse,Physics,21
1,Kotryna,Biochemistry,22
2,Xiaoyi,Chemistry,23


- `how=inner` means that only matches from the both tables are retained; no details for `Raoul` or `David` are used
- the `left_on` and `right_on` arguments are used to identify the column on which to join the tables
    - although here the column name is the same, they could be different in other scenarios

If we wanted to retain all of the names from the right table (`residents`), we can use `how="right"`:

In [45]:
students.merge(residents, how="right")

Unnamed: 0,Name,Subject,Age
0,Jesse,Physics,21
1,Kotryna,Biochemistry,22
2,Xiaoyi,Chemistry,23
3,Raoul,,29


- There is no data for `Subject` for `Raoul`, so this is returned as a `NaN` (missing value)

A similar result could be achieved by reversing the table order and using `left` instead of `right`:

In [46]:
residents.merge(students, how="left")

Unnamed: 0,Name,Age,Subject
0,Jesse,21,Physics
1,Kotryna,22,Biochemistry
2,Xiaoyi,23,Chemistry
3,Raoul,29,


- The only difference here is the column order
    - `Age` precedes `Subject` because `Age` was part of the `left` table, i.e. the table specified in the `how` argument

**Outer joins** will retain data found in one table but not the other. 

The `indicator` parameter is useful if we want to see which of the original tables the data in the other columns cam from:

In [47]:
residents.merge(students, how="outer", indicator=True)

Unnamed: 0,Name,Age,Subject,_merge
0,Jesse,21.0,Physics,both
1,Kotryna,22.0,Biochemistry,both
2,Xiaoyi,23.0,Chemistry,both
3,Raoul,29.0,,left_only
4,David,,Medicine,right_only


#### Joining DataFrames using each index as the key

In [49]:
data_a = pd.read_csv('data/drugs-data-a.csv', index_col='NDC Number')
display(data_a.head(2))
data_a.shape

Unnamed: 0_level_0,Date Introduced to Market,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Date,Acquisition Price,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
47335093640,2019-03-01,705.67,1.0,,,,,,
47335023683,2019-04-25,7500.0,,0.0,,,,,


(79, 9)

- Notice how we used the `index_col` parameter with the pandas `read_csv()` method to use the values in the `NDC Number` column as our index
- The `.shape` DataFrame attribute tells us that there are less rows in `data_a` than our previous DataFrame `df_new`

In [50]:
df_extra_on_index = df_new.merge(data_a, how='left', left_index=True, right_index=True)
df_extra_on_index.head(2)

Unnamed: 0_level_0,Date Introduced to Market_x,Manufacturer Name,Drug Product Description,Date Introduced to Market_y,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Date,Acquisition Price,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
72626260101,2019-01-02,Asegua Therapeutics LLC,agLDV/SOF (ledipasvir 90 mg/sofosbuvir 400 mg)...,,,,,,,,,
72626270101,2019-01-02,Asegua Therapeutics LLC,agSOF/VEL (sofosbuvir 400 mg/velpatasvir 100 m...,,,,,,,,,


- Here we used a **LEFT JOIN** (`how=left`), since we want to retain all data in the original DataFrame `df_new` and supplement it with associated data from the DataFrame `data_a`
- We used `left_index=True` and `right_index=True` to specify the `index` of each DataFrame as the **key** on which they will be joined
- We can see that the default values for the `suffix` parameter have been used, since there was a column in both of the DataFrames labelled `Date Introduced to Market`

#### Joining DataFrames using multiple columns as the key

In [51]:
df_extra = df_new.merge(data_a, how='left', on=['NDC Number', 'Date Introduced to Market'])
display(df_extra.head(2))
df_extra.shape

Unnamed: 0_level_0,Date Introduced to Market,Manufacturer Name,Drug Product Description,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Date,Acquisition Price,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
72626260101,2019-01-02,Asegua Therapeutics LLC,agLDV/SOF (ledipasvir 90 mg/sofosbuvir 400 mg)...,,,,,,,,
72626270101,2019-01-02,Asegua Therapeutics LLC,agSOF/VEL (sofosbuvir 400 mg/velpatasvir 100 m...,,,,,,,,


(206, 11)

- Here we have used the `on` parameter to provide a **list of column labels** which are present in both DataFrames
- Notice that this list can include the **row index name** (in this case, `NDC Number`)
- By including `Date Introduced to Market` in the key, we only see one instance of the column in the resulting DataFrame

#### Inner join

In [52]:
df_extra_inner = df_new.merge(data_a, on=['NDC Number', 'Date Introduced to Market'])
display(df_extra_inner.head(2))
df_extra_inner.shape

Unnamed: 0_level_0,Date Introduced to Market,Manufacturer Name,Drug Product Description,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Date,Acquisition Price,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
93765256,2019-01-03,Teva Pharmaceuticals USA,VARDENAFIL HCL TABLETS 2.5MG 30,704.59,1.0,101361.0,,,,,
93765356,2019-01-03,Teva Pharmaceuticals USA,VARDENAFIL HCL TABLETS 5MG 30,704.59,1.0,101361.0,,,,,


(79, 11)

- Using `.merge()` with the default `how` parameter results in an **INNER JOIN**
    - The resulting DataFrame has fewer rows; there were 79 rows with **matching keys**

#### Right join

In [53]:
df_extra_right = df_new.merge(data_a, how='right', on=['NDC Number', 'Date Introduced to Market'])
display(df_extra_right.head(2))
df_extra_right.shape

Unnamed: 0_level_0,Date Introduced to Market,Manufacturer Name,Drug Product Description,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Date,Acquisition Price,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
93765256,2019-01-03,Teva Pharmaceuticals USA,VARDENAFIL HCL TABLETS 2.5MG 30,704.59,1.0,101361.0,,,,,
93765356,2019-01-03,Teva Pharmaceuticals USA,VARDENAFIL HCL TABLETS 5MG 30,704.59,1.0,101361.0,,,,,


(79, 11)

- Notice how in this instance the resulting DataFrame from a **RIGHT JOIN** is the same as that returned when using an **INNER JOIN**
    - This is to be expected if all of the keys in the right DataFrame are present in the left DataFrame

```Python
pd.merge(left, right, how='left', on=['key1', 'key2'])
```

![left merge](img/left-merge-annotated.png)

#### pandas .merge() in detail

Let's work through the arguments (args) and keyword arguments (kwargs) for the DataFrame `.merge()` method found in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html):

```Python
DataFrame.merge(self, right, how='inner', 
                on=None, left_on=None, right_on=None, 
                left_index=False, right_index=False, 
                sort=False, suffixes=('_x', '_y'), 
                copy=True, indicator=False, validate=None)
```

```Python        
    DataFrame.merge(self, right,
                ```
        
- `self` refers to the **DataFrame** on which the method is being called, and is passed automatically to it; this DataFrame can be considered the **left** circle in each of the previous Venn diagrams
- `right` refers to the other **DataFrame** which we want to join with the original (left) DataFrame, and is represented by the right circle in the diagrams

```Python        
    DataFrame.merge(self, right, how='inner',
                ```

- `how` is the first **optional parameter** or **keyword argument**, all of which have default values
- The value given for `how` will determine how pandas attempts to **join** the two DataFrames
- In this case `inner` is the default value, which means that pandas will attempt an **inner join**

```Python
DataFrame.merge(self, right, how='inner', 
                on=None, left_on=None, right_on=None, 
                left_index=False, right_index=False, 
                ```
    
- The next parameters tell pandas which column(s) in each DataFrame contain(s) the **key** with which we want to join them
    - `on`, `left_on` and `right_on` can all take either a single label or a list of labels; where more than one label is used, **the values in all given columns in both DataFrames must match** for rows to be associated
    - `left_index` and `right_index` take Boolean values

- `on` gives us a way to associate the DataFrames using a single argument; useful if the key is in columns or indexes with the **same label in both DataFrames**
- if we don't use `on`, then we need to provide the key for each DataFrame separately:
    - For the **original** (left) DataFrame provide either `left_on=` with label(s) or `left_index=True`
    - For the **additional** (right) DataFrame provide either `right_on=` with label(s) or `right_index=True`

```Python
DataFrame.merge(self, right, how='inner', 
                on=None, left_on=None, right_on=None, 
                left_index=False, right_index=False, 
                sort=False, suffixes=('_x', '_y'), 
```

- `sort=True` would **sort** the resulting DataFrame by the **key**
- `suffixes` will append the given strings to any **column labels present in both DataFrames** (but not part of the key) so that they can be distinguished in the new DataFrame
    - **Consider adding these columns to the key** if they contain identical values; they will then only appear once in the new DataFrame

```Python
DataFrame.merge(self, right, how='inner', 
                on=None, left_on=None, right_on=None, 
                left_index=False, right_index=False, 
                sort=False, suffixes=('_x', '_y'), 
                copy=True, indicator=False, validate=None)
```    
- It's less likely that you will want to modify these keyword arguments, but:
    - `copy=False` could be used help to **save memory usage**
    - `indicator=True` adds a column giving information about the **source of each row** in relation to the join
    - `validate` allows checking of whether the merge is of a **specified type**, such as one-to-many, many-to-one

- As always, remember that you can and should refer to the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) if:
    - You have an **unusual scenario** to deal with or your **output is not as expected**
    - You regularly use a given method - the **optional parameters** often provide a quick way to **carry out common further processing tasks**

<img src="img/jupyter.png" width="200">

Now open the following workbook: `processing-pandas-workbook.ipynb`

### Data Preparation

Data  **pre-processing** or **cleaning** is often required to get our raw data into a more usable state:

- **Removal** of data not required for our task
- **Conversion** of values to an appropriate data type
- Checking for **missing values** and **fixing errors**

#### Removal of superfluous data

We can use methods and syntax previously seen to reduce the number of rows and columns in our dataset.

In [55]:
df_data = pd.read_csv('data/drugs.csv', index_col='NDC Number')
df_data.head(1)

Unnamed: 0_level_0,Manufacturer Name,Drug Product Description,Date Introduced to Market,WAC at Introduction,Marketing/Pricing Plan Description,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Date,Acquisition Price,Acquisition Price Non-Public Indicator,Acquisition Price Comment,General Comments,Supporting Documents
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
47335093640,SUN PHARMACEUTICALS,"Leuprolide Acetate Injection 1Mg/0.2Ml, 2.8Ml",2019-03-01,705.67,,1.0,,,,,,,,ESTIMATED_PATIENTS: unknown to Sun; MARKETING_...,


In [56]:
cols_to_drop = ['Date Introduced to Market', 'Acquisition Date', \
                'Acquisition Price','Marketing/Pricing Plan Description',\
                'Acquisition Price Comment', 'General Comments', \
                'Supporting Documents']

df_cols = df_data.drop(cols_to_drop, axis=1)
df_cols.head(1)

Unnamed: 0_level_0,Manufacturer Name,Drug Product Description,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
47335093640,SUN PHARMACEUTICALS,"Leuprolide Acetate Injection 1Mg/0.2Ml, 2.8Ml",705.67,1.0,,,,


- The DataFrame `.drop()` method allows us to drop any columns (`axis=1`) which are not required

In [57]:
df_sub = df_cols[df_cols['Manufacturer Name']!='Kyowa Kirin, Inc.'].copy()
display(df_cols.shape)
df_sub.shape

(206, 8)

(204, 8)

- We can remove rows which have particular values in a given column
- Using `.copy()` ensures that `df_sub` is a distinct object in memory and that subsequent changes to it will not affect the original DataFrame

#### Making data more usable  

We may encounter datasets where particular states or values are represented in a way which are **not suitable or optimal for analysis** using our chosen tools (such as pandas and Python), for example:

- Boolean values are **represented differently** (`Yes` | `No` or `1` | `0`)
- Percentages have **inconsistent formatting** (`0.42` or `42%`)
- Ambiguous dates have been **misinterpreted** (`dd-mm-yy` or `mm-dd-yy`)

After examining the dataset we notice that all of the `Indicator` columns contain values which are either `1.0` or `NaN`; we decide that replacing the `NaN` values with zeros will help with our analysis:

In [58]:
df_sub.columns

Index(['Manufacturer Name', 'Drug Product Description', 'WAC at Introduction',
       'Marketing/Pricing Plan Non-Public Indicator',
       'Estimated Number of Patients', 'Breakthrough Therapy Indicator',
       'Priority Review Indicator', 'Acquisition Price Non-Public Indicator'],
      dtype='object')

In [59]:
indicator_columns = df_sub.columns[df_sub.columns.str.contains("Indicator")]
df_sub[indicator_columns] = df_sub[indicator_columns].fillna(0).astype(int)
df_sub.head(2)

Unnamed: 0_level_0,Manufacturer Name,Drug Product Description,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
47335093640,SUN PHARMACEUTICALS,"Leuprolide Acetate Injection 1Mg/0.2Ml, 2.8Ml",705.67,1,,0,0,0
47335023683,SUN PHARMACEUTICALS,Ambrisentan 5 mg Tabs 30ct,7500.0,0,0.0,0,0,0


- We used the `.columns` attribute to access the column labels, and the `.contains()` method on the string (`.str`) of each one to identify those which contain `'Indicator'`
- The `.fillna()` method replaced the `NaN` values with zeros in the identified `indicator_columns`
- `.astype(int)` makes our DataFrame more readable and less ambiguous 

We also notice that in the `Estimated Number of Patients` column we see both `NaN` values and `0.0` values:

In [60]:
display(df_sub['Estimated Number of Patients'].isna().sum())
df_sub[df_sub['Estimated Number of Patients'] == 0].shape[0]

73

25

We determine that the `0.0` values should in fact be `NaN` values, because we assume they must be missing (rather than there being an expectation at `Estimated Number of Patients` will actually be zero):

In [61]:
df_sub['Estimated Number of Patients'] = df_sub['Estimated Number of Patients'].replace(0, np.nan)
df_clean = df_sub.copy()
df_clean.head(3)

Unnamed: 0_level_0,Manufacturer Name,Drug Product Description,WAC at Introduction,Marketing/Pricing Plan Non-Public Indicator,Estimated Number of Patients,Breakthrough Therapy Indicator,Priority Review Indicator,Acquisition Price Non-Public Indicator
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
47335093640,SUN PHARMACEUTICALS,"Leuprolide Acetate Injection 1Mg/0.2Ml, 2.8Ml",705.67,1,,0,0,0
47335023683,SUN PHARMACEUTICALS,Ambrisentan 5 mg Tabs 30ct,7500.0,0,,0,0,0
47335023783,SUN PHARMACEUTICALS,Ambrisentan 10 mg Tabs 30ct,7500.0,0,,0,0,0


- We used the `.replace()` method to replace zeros with `NaN` values, which can be created using `np.nan`
    - `np` is the alias we used when importing `numpy` earlier
    - the `.nan` attribute defines `NaN` values
- We assigned a copy of the cleaned DataFrame to `df_clean`

<img src="img/jupyter.png" width="200">

Now open the following workbook: `processing-pandas-workbook.ipynb`

#### Using the pandas `.apply()` method

The `.apply()` method allows us to **apply our own functions** to our data

- Typically this is used to create a new column or update an existing one with the results of calling the function with an existing column of values

In [74]:
from dataframes import dimensions
dimensions

Unnamed: 0,length (cm),width (cm),length (m),width (m),area (m2)
0,500,450,5.0,4.5,22.5
1,220,250,2.2,2.5,5.5
2,150,800,1.5,8.0,12.0


We'd like to have the dimensions in metres rather than centimetres. Here's a function we can apply:

In [75]:
def cm_to_m(cm):    
    return cm / 100

In [76]:
dimensions[['length (m)', 'width (m)']] = dimensions[['length (cm)', 'width (cm)']].apply(cm_to_m)
dimensions

Unnamed: 0,length (cm),width (cm),length (m),width (m),area (m2)
0,500,450,5.0,4.5,22.5
1,220,250,2.2,2.5,5.5
2,150,800,1.5,8.0,12.0


- We have applied the function to two columns, assigning the results to two new columns

We can also use `.apply()` using multiple values from a given row of a DataFrame:

In [77]:
def area(row):
    return row['length (m)'] * row['width (m)']

In [78]:
dimensions['area (m2)'] = dimensions.apply(area, axis=1)
dimensions

Unnamed: 0,length (cm),width (cm),length (m),width (m),area (m2)
0,500,450,5.0,4.5,22.5
1,220,250,2.2,2.5,5.5
2,150,800,1.5,8.0,12.0


- Notice that here the function relies on those columns being present in the DataFrame
- `axis=1` is requred to do this in a column-wise manner

In our drugs example, we notice that we have several instances where multiple entries in `Drug Product Description` refer to the same drug, but wiith differing dosage levels. We create a function which returns only the first word: 

In [79]:
def first_word(description):
    
    return description.split(' ')[0] if ' ' in description else description

- This function has a single parameter `description` and needs to be called with a string
- The `.split()` method will split the string wherever there is a space (`' '`) and then the first element (`[0]`) in the resulting list will be accessed
    - In the absence of a space,`description` will be returned

In [80]:
df_clean['Short Description'] = df_clean['Drug Product Description'].apply(first_word)

- We then used the `.apply()` method to apply our `first_word` function to `Drug Product Description`, creating the new column `Short Description`

#### Using the `.nunique()` method

The `.nunique()` method allows us to identify the **number of unique values** in a Series:

In [81]:
df_clean['Drug Product Description'].nunique()

196

In [82]:
df_clean['Short Description'].nunique()

87

In the example, we can see that our new `Short Description` column has less than half the number of unique values.

### Data Grouping and Aggregation

- We often need to calculate metrics for **subsets** (or groups) of our data
- A dataset can be **split** into **groups** of rows with common values in a given column
- **Calculations** can be **applied** to all groups simultaneously
- The **results** of these calculations can then be **combined** back together

![split-apply-combine](img/split-apply-combine.png)
*Source: [Github](https://camo.githubusercontent.com/60a1e7e95eaef8f9a99f43335368915eafedda3e/687474703a2f2f7777772e686f66726f652e6e65742f737461743537392f736c696465732f73706c69742d6170706c792d636f6d62696e652e706e67)*

#### Using the pandas `.groupby()` DataFrame method

In [63]:
sac = pd.DataFrame({'x': ['a', 'a', 'b', 'b', 'c', 'c'], 'y': [2, 4, 0, 5, 5, 10]})
sac

Unnamed: 0,x,y
0,a,2
1,a,4
2,b,0
3,b,5
4,c,5
5,c,10


In [64]:
sac.groupby('x')[['y']].mean()

Unnamed: 0_level_0,y
x,Unnamed: 1_level_1
a,3.0
b,2.5
c,7.5


*with the `sac` DataFrame, `groupby` column `x` and calculate for column(s) `['y']` the `mean` of each group*

- In this particular example, the result would be the same with the omission of `[['y']]`; **if no columns are specified, the function is applied to all columns** in the DataFrame

- Using **[single parentheses]** is possible when only specifying a **single column** to perform the operation on, but will result in a **Series** rather than a DataFrame being returned

You're unlikely to need to use the following code snippets in isolation, but let's examine each to help clarify **Split - Apply - Combine**:

In [65]:
gb = sac.groupby('x')
display(gb)
len(gb)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe1208e4b10>

3

- the `.groupby()` method created a **groupby object**, which has a length of `3`, i.e. the number of groups it holds (`a`, `b`, and `c`)

In [66]:
gb.get_group('a')

Unnamed: 0,x,y
0,a,2
1,a,4


- The `get_group()` method of a groupby object allows us to access each individual group which has been created

In [67]:
gb.get_group('a').mean()

y    3.0
dtype: float64

- The original `.groupby()` statement applies an operation such as the `.mean()` method used above to each group and **combines the results** into a new DataFrame

Another example showing a different method (`.count()`) being applied, using a different column to group by:

In [68]:
sac.groupby('y')[['x']].count()

Unnamed: 0_level_0,x
y,Unnamed: 1_level_1
0,1
2,1
4,1
5,2
10,1


- The **row index labels** show the **unique values** in column `y` of the original DataFrame, by which the values for column `x` have been grouped

Here's an example which applies the `.sum()`method to a subset of columns, with rows grouped by `Manufacturer Name`:

In [69]:
indicators = df_clean.groupby('Manufacturer Name')\
            [['Priority Review Indicator', 
              'Breakthrough Therapy Indicator',
              'Marketing/Pricing Plan Non-Public Indicator', 
              'Acquisition Price Non-Public Indicator']]\
            .sum()

indicators.tail(3)

Unnamed: 0_level_0,Priority Review Indicator,Breakthrough Therapy Indicator,Marketing/Pricing Plan Non-Public Indicator,Acquisition Price Non-Public Indicator
Manufacturer Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"Valeant Pharmaceuticals North America, LLC",0,0,1,0
ViiV Healthcare,0,0,1,0
Zydus Pharmaceuticals (USA) Inc.,0,0,8,0


In [70]:
indicators.columns=[ 'Priority', 'Breakthrough', 'Marketing', 'Acquisition']
ind_total = indicators.sort_values(by=['Priority'], ascending=False)
ind_total.head(5)

Unnamed: 0_level_0,Priority,Breakthrough,Marketing,Acquisition
Manufacturer Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AveXis,22,22,0,0
Janssen,7,7,0,0
Teva Pharmaceuticals USA,6,0,29,0
Karyopharm Therapeutics Inc.,4,0,4,0
"Paratek Pharmaceuticals, Inc.",4,0,4,0


- Here we have updated the column labels and used the `sort_values()` method to improve readability

#### Using the pandas `.agg()` method

In [71]:
ind_total.agg(['sum', 'mean'])

Unnamed: 0,Priority,Breakthrough,Marketing,Acquisition
sum,64.0,42.0,132.0,9.0
mean,1.391304,0.913043,2.869565,0.195652


- Using `.agg()` on a DataFrame applies the **functions** in the **[list]** on every **column** (the `axis` parameter has a default value of `0`)

In [72]:
ind_total.agg(['sum'], axis=1).sort_values(by='sum', ascending=False)\
         .rename(columns={'sum':'Total Indicators'}).head(3)

Unnamed: 0,Total Indicators
AveXis,44
Teva Pharmaceuticals USA,35
Par Pharmaceutical,15


- `axis=1` applies the function(s) to each row
- The **function name** is used for the resulting **column label** by default; here we used the `.rename()` method to update it

#### Combining `.groupby()` with `.agg()`

In [73]:
manu_agg = df_clean.groupby('Manufacturer Name')\
                   .agg(entries=('Manufacturer Name', 'size'), \
                        patient_estimates=('Estimated Number of Patients', 'count'))

manu_agg['missing_estimates'] = manu_agg['entries'] - manu_agg['patient_estimates']

manu_agg.sort_values(by='missing_estimates', ascending=False).head(3)

Unnamed: 0_level_0,entries,patient_estimates,missing_estimates
Manufacturer Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SUN PHARMACEUTICALS,10,0,10
Zydus Pharmaceuticals (USA) Inc.,8,0,8
"EMD Serono, Inc.",8,0,8


- Here we have used `.agg()` to **apply multiple functions** to a `groupby()` object
- Notice how each element in the tuple passed to `.agg()` is constructed as follows: 
```python
    result_column_name=('source_column', 'function')
```

- `'size'` returns the **number of values** in the given column, **including `NaN` values** (as such, the choice of column on which to apply it is unimportant, since all columns in a given DataFrame will have the same number)
- `'count'` **excludes `NaN` values**, so here the difference between the columns in `manu_agg` tells us how many entries do not have `patient_estimates` 

<img src="img/jupyter.png" width="200">

Now open the following workbook: `processing-pandas-workbook.ipynb`

#### Identifying and fixing unusual data issues

At some point we may encounter **unexpected results** in the output of our code

- in practice these **may only be identified after further work** with the dataset
- we need to **identify** and **isolate** potential causes of the issue
- we may need to do some **manual updating** to solve the problem
- if the issue is likely to recur, consider discussing with the author of the data source


#### Practical example

When creating this notebook, at some point we noticed that some entries in `Short Description` had not been shortened as expected:

In [84]:
df_clean[df_clean['Short Description'].str.contains('Cinacalcet')][
                 ['Drug Product Description', 'Short Description']]

Unnamed: 0_level_0,Drug Product Description,Short Description
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1
47335037983,Cinacalcet HCL Oral Tablet 30MG,Cinacalcet
47335038083,Cinacalcet HCL Oral Tablet 60MG,Cinacalcet
47335060083,Cinacalcet HCL Oral Tablet 90MG,Cinacalcet
378619793,"Cinacalcet Hydrochloride Tablets, 30mg, 30s",Cinacalcet
378619693,"Cinacalcet Hydrochloride Tablets, 60mg, 30s",Cinacalcet
378619593,"Cinacalcet Hydrochloride Tablets, 90mg, 30s",Cinacalcet
67877050330,Cinacalcet 30mg 30 Tabs,Cinacalcet 30mg 30 Tabs
67877050430,Cinacalcet 60mg 30 Tabs,Cinacalcet 60mg 30 Tabs
67877050530,Cinacalcet 90mg 30 Tabs,Cinacalcet 90mg 30 Tabs


In [85]:
df_clean.loc[67877050530, 'Drug Product Description']

'Cinacalcet\xa090mg\xa030\xa0Tabs'

- Having checked our code for the `first_word()` function we used earlier looks ok, we used `.loc()` to look at an example of a specific value which was not being processed as expected

- We can see that, in place of spaces, we see instances of `\xa0`, which are not visible when the DataFrame is displayed

**Finding help with unusual issues**

Turning to a [Google search](https://www.google.com/search?q=%5Cxa0) for help, we can find more information:

- A very helpful [Stack Overflow](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string) page with some possible solutions
- Some understanding of what kinds of [characters](https://terpconnect.umd.edu/~zben/Web/CharSet/htmlchars.html) may cause such problems

The top answer on the **Stack Overflow** page is in fact all we need:

```
string = string.replace(u'\xa0', u' ')
```

In [86]:
df_clean['Drug Product Description'] = df_clean['Drug Product Description'].str.replace(u'\xa0', u' ')
df_clean['Short Description'] = df_clean['Drug Product Description'].apply(first_word)
df_clean.loc[67877050530, 'Short Description']

'Cinacalcet'

We have fixed our problem by:

- **Checking** our code first
- **Searching** for help online
- **Adapting** some code we found

*We have learned that sometimes issues can occur due to the encoding of unusual characters; next time we will know that this can be an issue and how we might go about resolving it.*

#### The importance of context and domain knowledge

When using data from third parties, always keep in mind that the author may not have produced the dataset for the purposes you intend to use it for.

- Before working with a dataset, do what you can to understand *why* it was created and *how* the data was collected
- When working with a dataset, be alert for **unusual patterns**, **inconsistencies**, and **missing data**



#### Practical example

Consider the list of results and values for the entries with a `Short Description` of `Cinacalcet` (only the last 5 are shown below):

In [87]:
df_clean[df_clean['Short Description'] == 'Cinacalcet'].tail()\
        [['Short Description', 'Manufacturer Name', 'Drug Product Description', \
          'WAC at Introduction', 'Estimated Number of Patients']]

Unnamed: 0_level_0,Short Description,Manufacturer Name,Drug Product Description,WAC at Introduction,Estimated Number of Patients
NDC Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
378619693,Cinacalcet,Mylan Pharmaceuticals Inc,"Cinacalcet Hydrochloride Tablets, 60mg, 30s",1371.39,
378619593,Cinacalcet,Mylan Pharmaceuticals Inc,"Cinacalcet Hydrochloride Tablets, 90mg, 30s",2057.09,
67877050330,Cinacalcet,"Ascend Laboratories, LLC",Cinacalcet 30mg 30 Tabs,685.5,468000.0
67877050430,Cinacalcet,"Ascend Laboratories, LLC",Cinacalcet 60mg 30 Tabs,1371.25,468000.0
67877050530,Cinacalcet,"Ascend Laboratories, LLC",Cinacalcet 90mg 30 Tabs,2057.0,468000.0


Notice that we have entries with:
- A different `Manufacturer Name` and a similar `Drug Product Description`  
... *Can they be treated as the same drug?* 
    
- Similar but not identical values for `WAC at Introduction`  
... *Should these be considered to be equal?*  
    
- Matching values for `Estimated Number of Patients` for the same drug at different doses  
... *Is the total the sum of the values or just one of them?*      

We need to understand the **underlying data**, and the **context** in which it has been collected.