# Pandas introduction

# 0. Introduction

## 0.1. Why Pandas?

Pandas is a Python package that has become a standard for data science. It is very relevant because:
+ It allows the use of data structures similar to SQL tables or spreadsheets.
+ It includes several functions that make it easy to work with data sets.
+ It is also possible to create graphics!

Pandas documentation:
+ [Documentation](https://pandas.pydata.org/docs/)
+ [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

Some equivalances between Pandas and SQL can be found in the [Pandas documentation](https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html).

## 0.2. Import the package

<div class="alert alert-warning">
    Firstly, you have to install the package using the Anaconda Prompt:
    <br>
    $ pip install pandas
</div>

When you import a package you can import all the package.

In [None]:
import pandas
print(pandas.__version__)

Or import some specific functions from the package.

In [None]:
from pandas import DataFrame

It is also possible to rename the package, thus it is easier to invoke it. It is standard to import the Pandas package as `pd`.

In [None]:
import pandas as pd
print(pd.__version__)

## 0.3. Data structures in Pandas

Pandas has two main data structures: Series and DataFrames.

# 1. Series

*Pandas series* is a labelled vector. It can be considered as a one-column SQL table. It can be built using a *list*.

In [None]:
series1 = pd.Series(data=['a', 'b', 'c', 'd'])

The first column denotes the index, whereas the second one the elements. It also display the data type (*dtype*). By default, object type remains when introduce a list of strings. This is not a problem, but it can cause problems when using some functions that check if it is a string type or not.

In [None]:
series1

It is possible to explicitly request string by using the `dtype` parameter.

In [None]:
series2 = pd.Series(data=['a', 'b', 'c', 'd'], dtype=pd.StringDtype())
series2

Furthermore, it can be changed after the creation with the `astype` method.

In [None]:
series1 = series1.astype('string')
series1

Specific elements of the *series* can be accessed in the same way as with lists.

In [None]:
series1[1:3]

You can transform a *series* into a list by using the `tolist` method.

In [None]:
series1 = series1.tolist()
series1

## 1.1. Methods

There are some basic methods that can be applied to *series*:
+ `head` - display the first elements (default 10)
+ `sum` - sum of the *series* values
+ `mean` - average of the *series* values
+ `median` - median of the *series* values
+ `count` - number of elements
+ `unique` - get unique values
+ `sort_values` - sort the *series* values

In [None]:
series3 = pd.Series(data=[0, 10, 1, 2, 2, 5, 3, 4, 4, 8, 1, 2, 4])
series3

In [None]:
series3.head(3)

In [None]:
print(series3.sum())
print(series3.mean())
print(series3.median())
print(series3.count())

<div class="alert alert-warning">
    When unique method is applied the result is a <i>numpy ndarray</i>. It works as a <i>series</i>, but has some differences.
</div>

In [None]:
series3.unique()

You can become a *series* again without any problem.

In [None]:
pd.Series(series3.unique())

By default, when `sort_values` is applied the values are ordered in ascending order. It can be changed by using the `ascending` parameter.

In [None]:
series3.sort_values()

In [None]:
series3.sort_values(ascending=False)

<div class="alert alert-info">
    It is also possible to use multiple methods by concatenating them.
</div>

In [None]:
pd.Series(series3.unique()).sort_values()

## 1.2. Plots

A plot can be generated by using the `plot` method (default line plot). 

In [None]:
series3.plot()

A specific kind of plot can be specified by using the `kind` parameter, it could be:
+ `line` - line plot
+ `bar` - vertical bar plot
+ `barh` - horizontal bar plot
+ `hist` - histogram
+ `box` - boxplot
+ `kde` - Kernel Density Estimation plot
+ `density` - same as ‘kde’
+ `area` - area plot
+ `pie` - pie plot

In [None]:
series3.plot(kind='box')

It is also possible to invoke the specific kind of plot (not all use the same name). For instance, the histogram function `plot.hist` or `hist`.

In [None]:
series3.plot.hist()

<div class="alert alert-warning">
    The hist method passes additional arguments with respect to <i>plot.hist</i>.
</div>

In [None]:
series3.hist()

# 2. DataFrame

Pandas *DataFrame* is a two-dimensional data structure. Each column can contain a different data type, just like a SQL table (without having to declare any specific data type) or a spreadsheet.

*It is standard to create a DataFrame using the name `df`.

In [None]:
df = pd.DataFrame({'column_1': [1, 2, 3, 4], 'column_2': ['a', 'b', 'c', 'd']})

The visualization of a *DataFrame* is very clear, using a table structure. As with the series, an index is also generated.

In [None]:
df

## 2.1. Import files as DataFrame

A csv/tsv file can be easily imported by using `read_csv`. By default it recognizes the comma as the delimiter character, to indicate a different one you must use the `sep` parameter. It is important to take into account character encoding with the `encoding` parameter.

<div class="alert alert-info">
    The two dots (..) are used to navigate to the parent directory.
</div>

In [None]:
df = pd.read_csv('../data/publications.txt', sep='\t', encoding='utf-8')

Apart from previewing the data, it also indicates its size (30021 rows × 7 columns). This can also be obtained using the `shape` method.

In [None]:
df

In [None]:
df.shape

The columns can be renamed with the `rename` method.

In [None]:
df = df.rename(columns={'authors':'pub_authors'})
df

Checks the data types with which the file has been imported to fix possible problems.

<div class="alert alert-warning">
    Note that the presence of null values may alter some types.
</div>

In [None]:
df.dtypes

As with series, you can specify the data type directly when creating the object by using the `dtype` parameter.

In [None]:
df = pd.read_csv('../data/publications.txt', sep='\t', encoding='utf-8', dtype={'authors': 'string', 'journal_title': 'string', 'paper_title': 'string'})
df

Or transform one specific column after the data import process with the `astype` method.

In [None]:
df['abstract'] = df['abstract'].astype('string')

Everything is fine now!

In [None]:
df.dtypes

The `describe` method is used for calculating some statistics.

In [None]:
df.describe()

The `count` method can also be used to get the number of non-NA cells.

In [None]:
df.count()

## 2.2. Data selection and filtering 

### 2.2.1. Selection

<div class="alert alert-info">
    Similar to <b>SQL SELECTION</b>.
</div>

There are some methods that can be used for selecting data (you need to specify the exact rows and/or columns that wou want to select):
+ `[]` - select columns by name or rows by booleans
+ `iloc[]` - select rows and columns by positions/booleans
+ `loc[]` - select rows and columns by labels/booleans

It is possible to select columns by `[]`.

<div class="alert alert-warning">
    Note that the result is a Pandas <i>series</i>. <b>It always simplify the object</b>. It can be avoid by using a <i>list</i>.
</div>

In [None]:
df['paper_title']

In [None]:
df[['paper_title']]

To select more than one column you have to use a `list`. This data selection is useful to change the order of the columns.

In [None]:
df[['paper_title', 'pub_year']]

Using `[]` you can select rows by using a *list* of booleans.

In [None]:
df[[True, False, False, False, True]+[False]*30016]

However, the rows selection by positions has to be done with the method `iloc`.

<div class="alert alert-warning">
    (Once again) Note that the result is a Pandas <i>series</i>. It can be avoid by using a <i>list</i>.
</div>

In [None]:
df.iloc[0]

In [None]:
df.iloc[[0]]

The row and column positicion can be specified using a comma `[row, column]`.

<div class="alert alert-warning">
    In this case, as only one cell with a <i>string</i> is selected it returns this <i>string</i>.
</div>

In [None]:
df.iloc[2, 5]

In [None]:
df.iloc[[2], [5]]

You can introduce a range of positions.

In [None]:
df.iloc[[0,1,2,3], [5]]

By using a colon (:) it retrieves all the positions.

In [None]:
df.iloc[[1,2,3], :]

In [None]:
df.iloc[:, [1,2,3]]

By using `loc` the columns are specified by their names and the rows by the index labels.

<div class="alert alert-warning">
    Note that by default the index is a list of <i>integers</i>, but it may changes.
</div>

In [None]:
df.loc[[1,2,3], 'paper_title']

In [None]:
df.loc[0, 'paper_title']

Bot of them (`loc` and `iloc`) support boolean lists.

In [None]:
df.loc[[True, False, False, True]+[False]*30017, [False, False, True, False, False, True, False]]

### 2.2.2. Filtering

<div class="alert alert-info">
    Similar to <b>SQL WHERE</b>.
</div>

These selection options can be combined with more advanced options to filter the data, all of them inside the `[]`:
+ ==, >, >=, <, <=, != - comparison operators
+ `isin` - rows whose values are in a specified *list*
+ `contains` - rows whose values include a string

For instance, the dataset can be filtered to papers with more than 5 citations.

In [None]:
df[df['n_cits'] > 5]

And it can be also filtered to get only some specific columns. It can be done in different ways.

In [None]:
df[df['n_cits'] > 5][['paper_title', 'n_cits']]

In [None]:
df.loc[df['n_cits'] > 5,['paper_title', 'n_cits']]

<div class="alert alert-info">
    Note that this is the same as if we modify the table step by step.
</div>

In [None]:
df_f = df[df['n_cits'] > 5]
df_f = df_f[['paper_title', 'n_cits']]
df_f

The rows can be filtered by comparing rows with exact values.

In [None]:
df[df['journal_title'] == 'Scientometrics']

Or in the opposite direction, excluding rows by an exact value.

In [None]:
df[df['journal_title'] != 'Scientometrics']

If you want to do this comparison using a *list* of values, you have to use the `isin` method.

In [None]:
df[df['journal_title'].isin(['PLOS ONE', 'Nature', 'Science'])]

To exclude a *list* of values you have to use a `~` before the data selection.

In [None]:
df[~ df['journal_title'].isin(['PLOS ONE', 'Nature', 'Science'])]

<div class="alert alert-warning">
    Note that before <i>contains</i> you have to specify <i>str</i>.
    Also, <b>this method is case sensitive by default</b>.
</div>

In [None]:
df[df['paper_title'].str.contains('altmetric')]

In [None]:
df[df['paper_title'].str.contains('Altmetric')]

<div class="alert alert-block alert-danger">
    It returns error if rows have NA/NaN values.
</div>

In [None]:
df[df['abstract'].str.contains('altmetric')]

These empty rows can be omitted by using the `na` parameter.

In [None]:
df[df['abstract'].str.contains('altmetric', na=False)]

All these methods can be combined and multiple conditions can be indicated at the same time. The logical operators are:
+ & - AND
+ | - OR

For instance, select papers that contain the word altmetric in the title or in the abstract.

In [None]:
df[df['paper_title'].str.contains('altmetric', case=False) | df['abstract'].str.contains('twitter', case=False, na=False)]

Or select papers that contain the word altmetric in the title and have more than 10 citations.

<div class="alert alert-warning">
    Do not forget to include () when you use more than one comparison operator.
</div>

In [None]:
df[df['paper_title'].str.contains('altmetric', case=False) & (df['n_cits'] > 10)]

## 2.3. Grouping methods

<div class="alert alert-info">
    Similar to <b>SQL GROUP BY</b>.
</div>

Firstly, it is possible to remove duplicated rows by using the `drop_duplicates` method. For instance, to get unique pairs of journal titles and years.

In [None]:
df[['journal_title', 'pub_year']].drop_duplicates()

As there are some NA values, the can be removed with the `dropna` method or using the `dropna` parameter in the `groupby` method. By default it removes all the NA cells in all the columns, but it is possible to specify one specific column(s) with the `subset` parameter.

In [None]:
df[['journal_title', 'pub_year']].drop_duplicates().dropna()

In [None]:
df[['journal_title', 'pub_year']].drop_duplicates().dropna(subset=['journal_title'])

With the `groupby` method you can group the *DataFrame*, indicating by which columns should be grouped, and then apply some specific methods.

In [None]:
df[['pub_year', 'paper_title']].groupby(['pub_year'])

It is possible to get the number of rows (non-NA) using the `count` method. For instance, the number of papers by years.

In [None]:
df[['pub_year', 'paper_title']].groupby(['pub_year']).count()

After applying a `groupby` and some method, the resulting *DataFrame* columns could be renamed with `rename` and the index restarted with `reset_index`.

In [None]:
df[['pub_year', 'paper_title']].groupby(['pub_year']).count().rename(columns={'paper_title':'papers'}).reset_index()

Other methods can be used after the grouping. For instance, `sum` and `mean`.

In [None]:
df[['pub_year', 'n_cits']].groupby(['pub_year']).mean().rename(columns={'n_cits':'mean_cits'}).reset_index()

As the *journal_title* column includes some NA values, the parameter `dropna` is used.

In [None]:
df[['journal_title', 'n_cits']].groupby(['journal_title'], dropna=True).sum().rename(columns={'n_cits':'sum_cits'}).reset_index()

The *DataFrame* can be sorted by the values of one column using `sort_values`.

In [None]:
df[['journal_title', 'n_cits']].groupby(['journal_title'], dropna=True).sum().rename(columns={'n_cits':'sum_cits'}).reset_index().sort_values(by='sum_cits', ascending=False)

Other intersting method is `agg` as it allows to aggregate using one or more operations to summarise the *DataFrame*.

In [None]:
df[['n_cits', 'n_refs']].agg(['sum', 'min', 'max', 'mean', 'std'])

It is also possible to apply specific opertations to each columns.

In [None]:
df[['n_cits', 'n_refs']].agg({'n_cits' : ['sum', 'min'], 'n_refs' : ['min', 'max']})

You can combine `agg` with `groupby` to apply multiple operations at the same time.

In [None]:
df[['journal_title', 'n_cits']].groupby(['journal_title'], dropna=True).agg(['count', 'sum', 'min', 'max', 'mean', 'std'])

Futhermore, they can be applied in a selective way.

In [None]:
df[['journal_title', 'n_cits', 'n_refs']].groupby(['journal_title'], dropna=True).agg({'n_cits' : ['sum', 'min', 'count'], 'n_refs' : ['mean', 'max']})

<div class="alert alert-warning">
    Notice that after applying this method it returns a DataFrame with two column name levels.
</div>

In [None]:
df_agg = df[['journal_title', 'n_cits', 'n_refs']].groupby(['journal_title'], dropna=True).agg({'n_cits' : ['sum', 'min', 'count'], 'n_refs' : ['mean', 'max']})
df_agg

It is possible to unstack them using the `map` method.

In [None]:
df_agg.columns = df_agg.columns.map('_'.join)
df_agg

## 2.4. Joining

<div class="alert alert-info">
    Similar to <b>SQL JOIN</b>.
</div>

For this section, two subsets are created.

In [None]:
df1 = df[['paper_title', 'pub_year']]
df2 = df[['paper_title', 'journal_title']]

Two *DataFrame* can be joined using the `merge` method. The join mehtod can be specified using the `how` parameter:
+ left - left outer join
+ right - right outer join
+ outer - full outer join
+ inner - inner join

The key column used for joining is specified with the `on` parameter. If this column has a different name in the two *DataFrames* you have to use `left_on` and `right_on` insted.

In [None]:
pd.merge(df1, df2, how='inner', on='paper_title')

As the second *DataFrame* includes some NA-values, they are removed before the joining.

In [None]:
pd.merge(df1, df2.dropna(), how='inner', on='paper_title')

## 2.5. Plots

Two *DataFrames* are created.

In [None]:
df3 = df[['paper_title', 'n_cits', 'n_refs']].groupby(['paper_title'], dropna=True).sum().rename(columns={'n_cits':'sum_cits', 'n_refs':'sum_refs'}).reset_index().sort_values(by='sum_cits', ascending=False)
df3

In [None]:
df4 = df[['pub_year', 'n_cits', 'paper_title']].groupby(['pub_year']).agg({'n_cits' : ['sum'], 'paper_title' : ['count']}).rename(columns={'paper_title':'papers'}).reset_index()
df4.columns = df4.columns.map('_'.join)
df4

As seen in the series a plot can be generated by using the `plot` method (default line plot). A specific kind of plot can be specified by using the `kind` parameter, it could be:
+ `line` - line plot
+ `bar` - vertical bar plot
+ `barh` - horizontal bar plot
+ `hist` - histogram
+ `box` - boxplot
+ `kde` - Kernel Density Estimation plot
+ `density` - same as ‘kde’
+ `area` - area plot
+ `pie` - pie plot
+ `scatter` - scatter plot
+ `hexbin` - hexbin plot

The scatter and hexbin plots are specific for *DataFrames* as you have to indicate a *x* and *y* series.

In [None]:
df3.plot(kind='scatter', x='sum_cits', y='sum_refs')

In [None]:
df4.plot(kind='barh', x='pub_year_')

It is also possible to invoke the specific kind of plot. Not all use the same name, for example as with boxplots.

In [None]:
df3.boxplot()

With the use of `showfliers` the outliers are removed, while `grid` is used to remove the grids.

In [None]:
df3.boxplot(showfliers=False, grid=False)

## 2.6. Export data

*DataFrames* can be exported as csv/tsv files using the `to_csv` method. The `sep` parameter is used to indicate the delimiter character and with `index` we can specify that the index is not exported.

In [None]:
df3.to_csv('../data/agg_dataframe.tsv', sep='\t', index=False)