## Pandas Tutorial

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Agenda

- What is Data Frames?
- What is Data Series?
- Different operation in Pandas

In [1]:
## First step is to import pandas

import pandas as pd
import numpy as np

In [2]:
## Playing with Dataframe

df=pd.DataFrame(np.arange(0,20).reshape(5,4),index=['Row1','Row2','Row3','Row4','Row5'],columns=["Column1","Column2","Column3","Coumn4"])

In [3]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Coumn4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [4]:
## Accessing the elements

df.loc['Row1']

Column1    0
Column2    1
Column3    2
Coumn4     3
Name: Row1, dtype: int32

In [5]:
## Check the type

type(df.loc['Row1'])

pandas.core.series.Series

In [6]:
df.iloc[:,:]

Unnamed: 0,Column1,Column2,Column3,Coumn4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [7]:
## Take the elements from the Column2
df.iloc[1:,1:]


Unnamed: 0,Column2,Column3,Coumn4
Row2,5,6,7
Row3,9,10,11
Row4,13,14,15
Row5,17,18,19


In [8]:
#convert Dataframes into array
df.iloc[:].values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

In [9]:
df['Column1'].value_counts()

0     1
4     1
8     1
12    1
16    1
Name: Column1, dtype: int64

In [10]:
df=pd.read_csv('gender_submission.csv')

In [11]:
df.head(n=16)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB


In [13]:
#Pandas describe() is used to view some basic statistical details like percentile,
#mean, std etc. of a data frame or a series of numeric values.

df.describe()

Unnamed: 0,PassengerId,Survived
count,418.0,418.0
mean,1100.5,0.363636
std,120.810458,0.481622
min,892.0,0.0
25%,996.25,0.0
50%,1100.5,0.0
75%,1204.75,1.0
max,1309.0,1.0


In [14]:
#Get the unique category counts
df['Survived'].value_counts()

0    266
1    152
Name: Survived, dtype: int64

In [15]:
df[df['PassengerId']>1300]

Unnamed: 0,PassengerId,Survived
409,1301,1
410,1302,1
411,1303,1
412,1304,1
413,1305,0
414,1306,1
415,1307,0
416,1308,0
417,1309,0


In [16]:
df.corr()

Unnamed: 0,PassengerId,Survived
PassengerId,1.0,-0.023245
Survived,-0.023245,1.0


In [17]:
import numpy as np

In [18]:
lst_data=[[1,2,3],[3,4,np.nan],[5,6,np.nan],[np.nan,np.nan,np.nan]]

In [19]:
df=pd.DataFrame(lst_data)

In [20]:
df.head()

Unnamed: 0,0,1,2
0,1.0,2.0,3.0
1,3.0,4.0,
2,5.0,6.0,
3,,,


In [21]:
## HAndling Missing Values

##Drop nan values

df.dropna(axis=0)

Unnamed: 0,0,1,2
0,1.0,2.0,3.0


In [22]:
df.dropna(axis=1)

0
1
2
3


In [23]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
                     columns=['one', 'two', 'three'])

In [24]:
df.head()

Unnamed: 0,one,two,three
a,-1.646659,0.017659,1.040844
c,0.2395,0.022055,-1.808276
e,-1.998557,-0.552188,0.241354
f,0.533755,-1.141296,-2.382672
h,-0.044448,-0.575298,0.934528


In [25]:
df2=df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [26]:
df2

Unnamed: 0,one,two,three
a,-1.646659,0.017659,1.040844
b,,,
c,0.2395,0.022055,-1.808276
d,,,
e,-1.998557,-0.552188,0.241354
f,0.533755,-1.141296,-2.382672
g,,,
h,-0.044448,-0.575298,0.934528


In [27]:
df2.dropna(axis=0)

Unnamed: 0,one,two,three
a,-1.646659,0.017659,1.040844
c,0.2395,0.022055,-1.808276
e,-1.998557,-0.552188,0.241354
f,0.533755,-1.141296,-2.382672
h,-0.044448,-0.575298,0.934528


In [28]:
pd.isna(df2['one'])

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [29]:
df2['one'].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool

In [30]:
df2.fillna('Missing')

Unnamed: 0,one,two,three
a,-1.646659,0.017659,1.040844
b,Missing,Missing,Missing
c,0.2395,0.022055,-1.808276
d,Missing,Missing,Missing
e,-1.998557,-0.552188,0.241354
f,0.533755,-1.141296,-2.382672
g,Missing,Missing,Missing
h,-0.044448,-0.575298,0.934528


In [31]:
df2['one'].values

array([-1.64665932,         nan,  0.23950044,         nan, -1.99855683,
        0.53375535,         nan, -0.04444822])

In [32]:
### Reading different data sources with the help of pandas

### Read files

## CSV

In [33]:
from io import StringIO, BytesIO

In [34]:
data = ('col1,col2,col3\n'
            'x,y,1\n'
            'a,b,2\n'
            'c,d,3')

In [35]:
type(data)

str

In [36]:
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,x,y,1
1,a,b,2
2,c,d,3


In [37]:
## Read from specific columns
df=pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])
df

Unnamed: 0,col1,col3
0,x,1
1,a,2
2,c,3


In [38]:
df.to_csv('Test.csv')

In [39]:
## Specifying columns data types

data = ('a,b,c,d\n'
            '1,2,3,4\n'
            '5,6,7,8\n'
            '9,10,11')


In [40]:
print(data)

a,b,c,d
1,2,3,4
5,6,7,8
9,10,11


In [41]:
df=pd.read_csv(StringIO(data),dtype=object)

In [42]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4.0
1,5,6,7,8.0
2,9,10,11,


In [43]:
df['a']

0    1
1    5
2    9
Name: a, dtype: object

In [44]:
df=pd.read_csv(StringIO(data),dtype={'b':int,'c':np.float,'a':'Int64'})

In [45]:
df

Unnamed: 0,a,b,c,d
0,1,2,3.0,4.0
1,5,6,7.0,8.0
2,9,10,11.0,


In [46]:
df['a'][1]

5

In [47]:
## check the datatype
df.dtypes

a      Int64
b      int32
c    float64
d    float64
dtype: object

In [48]:
## Index columns and training delimiters


In [49]:
data = ('index,a,b,c\n'
           '4,apple,bat,5.7\n'
            '8,orange,cow,10')

In [50]:
pd.read_csv(StringIO(data),index_col=0)

Unnamed: 0_level_0,a,b,c
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
4,apple,bat,5.7
8,orange,cow,10.0


In [51]:
 data = ('a,b,c\n'
           '4,apple,bat,\n'
            '8,orange,cow,')

In [52]:
pd.read_csv(StringIO(data))

Unnamed: 0,a,b,c
4,apple,bat,
8,orange,cow,


In [53]:
pd.read_csv(StringIO(data),index_col=False)

Unnamed: 0,a,b,c
0,4,apple,bat
1,8,orange,cow


In [54]:
## Combining usecols and index_col
data = ('a,b,c\n'
           '4,apple,bat,\n'
            '8,orange,cow,')

In [55]:
pd.read_csv(StringIO(data), usecols=['b', 'c'],index_col=False)

Unnamed: 0,b,c
0,apple,bat
1,orange,cow


In [56]:
## Quoting and Escape Characters¶. Very useful in NLP

data = 'a,b\n"hello, \\"Bob\\", nice to see you",5'

In [57]:
pd.read_csv(StringIO(data),escapechar='\\')

Unnamed: 0,a,b
0,"hello, ""Bob"", nice to see you",5


In [58]:
## URL to CSV

df=pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item',sep='\t')

HTTPError: HTTP Error 403: Forbidden

In [None]:
df.head()

In [None]:
## Read Json to CSV

In [None]:
Data = '{"employee_name": "James", "email": "james@gmail.com", "job_profile": [{"title1":"Team Lead", "title2":"Sr. Developer"}]}'
df1=pd.read_json(Data)

In [None]:
df1

In [None]:
df1.to_json()

In [None]:
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

In [None]:
df = df.head()

In [None]:
# convert Json to csv

In [None]:
df.to_csv('wine.csv')

In [None]:
# convert Json to different json formats

df.to_json(orient="values")

In [None]:

df.to_json(orient="records")

## Reading HTML content 

In [59]:
html_table = """ 
<table> 
  <thead> 
    <tr> 
      <th>ID</th> 
      <th>Name</th> 
      <th>Branch</th> 
      <th>Result</th> 
    </tr> 
  </thead> 
  <tbody> 
    <tr> 
      <td>5</td> 
      <td>Patrick</td> 
      <td>Civil</td> 
      <td>Pass</td> 
    </tr> 
    <tr> 
      <td>1</td> 
      <td>Maverick</td> 
      <td>Mechanical</td> 
      <td>Fail</td> 
    </tr> 
    <tr> 
      <td>4</td> 
      <td>Peter</td> 
      <td>Computer Science</td> 
      <td>Pass</td> 
    </tr> 
    <tr> 
      <td>8</td> 
      <td>Parker</td> 
      <td>Chemical</td> 
      <td>Fail</td> 
    </tr> 
  </tbody> 
</table> 
"""  

dfs = pd.read_html(html_table)

In [60]:
dfs[0]

Unnamed: 0,ID,Name,Branch,Result
0,5,Patrick,Civil,Pass
1,1,Maverick,Mechanical,Fail
2,4,Peter,Computer Science,Pass
3,8,Parker,Chemical,Fail


In [61]:
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Country', header=0)

In [62]:
dfs[0]

Unnamed: 0,Mobile country code,Country,ISO 3166,Mobile network codes,National MNC authority,Remarks
0,289,A Abkhazia,GE-AB,List of mobile network codes in Abkhazia,,MCC is not listed by ITU
1,412,Afghanistan,AF,List of mobile network codes in Afghanistan,,
2,276,Albania,AL,List of mobile network codes in Albania,,
3,603,Algeria,DZ,List of mobile network codes in Algeria,,
4,544,American Samoa (United States of America),AS,List of mobile network codes in American Samoa,,
...,...,...,...,...,...,...
247,452,Vietnam,VN,List of mobile network codes in the Vietnam,,
248,543,W Wallis and Futuna,WF,List of mobile network codes in Wallis and Futuna,,
249,421,Y Yemen,YE,List of mobile network codes in the Yemen,,
250,645,Z Zambia,ZM,List of mobile network codes in Zambia,,


## Reading EXcel Files

In [63]:
df_excel=pd.read_excel('sheet.xlsx')

sometimes this code gives an error for xlsx files as: XLRDError:Excel xlsx file; not supported instead , you can use openpyxl engine to read excel file.

In [64]:
df_excel = pd.read_excel(r'sheet.xlsx', engine='openpyxl')

In [65]:
df_excel.head()

Unnamed: 0.1,Unnamed: 0,PART,PART_DESC,PROGRAM,VENDOR_ID,VENDOR_NAME,SITE,P_CODE,GSM_COMMODITY,COMMODITY_SUBGROUP,...,SITE_FLAG,PROGRAM_FLAG,P_CODE_FLAG,BUY_SELL_FLAG,PUBLISH_CHK_FLAG,DW_CREATE_TS,DW_CREATE_USER,DW_UPDATE_TS,DW_UPDATE_USER,SUPPLIER_ALLOCATION
0,0.0,452-04418,"SCREW,M1.6X0.35,FEED,X1920",J160,000049M,AMPHENOL,ALL_SITES,,Enclosure - Mac,Uncategorized,...,0.0,0.0,2.0,0.0,N,2021-05-22,1.0,2021-12-14 04:02:53,1.0,0.0
1,1.0,452-04418,"SCREW,M1.6X0.35,FEED,X1920",J170,000049M,AMPHENOL,ALL_SITES,,Enclosure - Mac,Uncategorized,...,0.0,0.0,2.0,0.0,N,2021-05-22,1.0,2021-12-14 04:02:53,1.0,0.0
2,2.0,514-00091,"CONN,RCPT,RJ45,X667",J137,000049M,AMPHENOL,QSMC_GRP,,Connector,Connector,...,0.0,2.0,2.0,0.0,N,2020-12-06,1.0,2022-02-28 23:03:01,2700846000.0,80.0
3,3.0,514-00167,"CONN,RCPT,MPM,X1982",J314,000049M,AMPHENOL,ALL_SITES,,Connector,Connector,...,0.0,1.0,2.0,0.0,N,2021-03-27,1.0,2021-11-11 21:53:52,2700846000.0,45.0
4,4.0,514-00167,"CONN,RCPT,MPM,X1982",J316,000049M,AMPHENOL,ALL_SITES,,Connector,Connector,...,0.0,1.0,2.0,0.0,N,2021-03-27,1.0,2021-11-11 21:53:52,2700846000.0,45.0


## Pickling
All pandas objects are equipped with to_pickle methods which use Python’s cPickle module to save data structures to disk using the pickle format.

In [66]:
df_excel.to_pickle('df_excel')

In [67]:
df=pd.read_pickle('df_excel')

In [68]:
df.head()

Unnamed: 0.1,Unnamed: 0,PART,PART_DESC,PROGRAM,VENDOR_ID,VENDOR_NAME,SITE,P_CODE,GSM_COMMODITY,COMMODITY_SUBGROUP,...,SITE_FLAG,PROGRAM_FLAG,P_CODE_FLAG,BUY_SELL_FLAG,PUBLISH_CHK_FLAG,DW_CREATE_TS,DW_CREATE_USER,DW_UPDATE_TS,DW_UPDATE_USER,SUPPLIER_ALLOCATION
0,0.0,452-04418,"SCREW,M1.6X0.35,FEED,X1920",J160,000049M,AMPHENOL,ALL_SITES,,Enclosure - Mac,Uncategorized,...,0.0,0.0,2.0,0.0,N,2021-05-22,1.0,2021-12-14 04:02:53,1.0,0.0
1,1.0,452-04418,"SCREW,M1.6X0.35,FEED,X1920",J170,000049M,AMPHENOL,ALL_SITES,,Enclosure - Mac,Uncategorized,...,0.0,0.0,2.0,0.0,N,2021-05-22,1.0,2021-12-14 04:02:53,1.0,0.0
2,2.0,514-00091,"CONN,RCPT,RJ45,X667",J137,000049M,AMPHENOL,QSMC_GRP,,Connector,Connector,...,0.0,2.0,2.0,0.0,N,2020-12-06,1.0,2022-02-28 23:03:01,2700846000.0,80.0
3,3.0,514-00167,"CONN,RCPT,MPM,X1982",J314,000049M,AMPHENOL,ALL_SITES,,Connector,Connector,...,0.0,1.0,2.0,0.0,N,2021-03-27,1.0,2021-11-11 21:53:52,2700846000.0,45.0
4,4.0,514-00167,"CONN,RCPT,MPM,X1982",J316,000049M,AMPHENOL,ALL_SITES,,Connector,Connector,...,0.0,1.0,2.0,0.0,N,2021-03-27,1.0,2021-11-11 21:53:52,2700846000.0,45.0


## Setup

In [69]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

pd.options.display.max_rows = None
warnings.filterwarnings("ignore")

## Introduction

"I wish I could do this operation in Pandas…."

Well, chances are, you can!

Pandas is so vast and deep that it enables you to execute virtually any tabular manipulation you can think of. However, this vastness sometimes comes at a disadvantage.

Many elegant features that solve rare edge-cases, unique scenarios are lost in the documentation, shadowed by the more frequently used functions.

This kernel aims to rediscover those features and show you that Pandas is more capable than you ever knew.

## 1. `ExcelWriter`

`ExcelWriter` is a generic class for creating excel files (with sheets!) and writing DataFrames to them. Let's say we have these 2:

In [70]:
# Load two datasets
diamonds = sns.load_dataset("diamonds")
tips = sns.load_dataset("tips")

```python
# Write to the same excel file
with pd.ExcelWriter("data.xlsx") as writer:

    diamonds.to_excel(writer, sheet_name="diamonds")
    tips.to_excel(writer, sheet_name="tips")
```

It has additional attributes to specify the DateTime format to be used, whether you want to create a new excel file or modify an existing one, what happens when a sheet exists, etc. Check out the details from the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.ExcelWriter.html).

## 2. `pipe`

`pipe` is one of the best functions for doing data cleaning in a concise, compact manner in Pandas. It allows you to chain multiple custom functions into a single operation.

For example, let's say you have functions to `drop_duplicates`, `remove_outliers`, `encode_categoricals` that accept their own arguments. Here is how you apply all three in a single operation:

```python
df_preprocessed = (diamonds.pipe(drop_duplicates).
                            pipe(remove_outliers, ['price', 'carat', 'depth']).
                            pipe(encode_categoricals, ['cut', 'color', 'clarity'])
                  )
```

I like how this function resembles [Sklearn pipelines](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d). There is more you can do with it, so check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html) or this [helpful article](https://towardsdatascience.com/a-better-way-for-data-preprocessing-pandas-pipe-a08336a012bc).

## 3. `factorize`

This function is a pandas alternative to Sklearn's `LabelEncoder`:

In [71]:
# Mind the [0] at the end
diamonds["cut_enc"] = pd.factorize(diamonds["cut"])[0]

diamonds["cut_enc"].sample(5)

24504    1
30680    0
52931    0
27048    1
34575    0
Name: cut_enc, dtype: int64

Unlike `LabelEncoder`, `factorize` returns a tuple of two values: the encoded column and a list of the unique categories:

In [72]:
codes, unique = pd.factorize(diamonds["cut"], sort=True)

codes[:10]

array([2, 3, 1, 3, 1, 4, 4, 4, 0, 4], dtype=int64)

In [73]:
unique

Index(['Fair', 'Good', 'Ideal', 'Premium', 'Very Good'], dtype='object')

## 4. `explode` - 🤯🤯

A function with an interesting name is `explode`. Let's see an example first and then, explain:

In [74]:
data = pd.Series([1, 6, 7, [46, 56, 49], 45, [15, 10, 12]]).to_frame("dirty")
data

Unnamed: 0,dirty
0,1
1,6
2,7
3,"[46, 56, 49]"
4,45
5,"[15, 10, 12]"


The `dirty` column has two rows where values are recorded as actual lists. You may often see this type of data in surveys as some questions accept multiple answers.

In [75]:
data.explode("dirty", ignore_index=True)

Unnamed: 0,dirty
0,1
1,6
2,7
3,46
4,56
5,49
6,45
7,15
8,10
9,12


`explode` takes a cell with an array-like value and explodes it into multiple rows. Set `ignore_index` to True to keep the ordering of a numeric index.

## 5. `squeeze`

Another function with a funky name is `squeeze` and is used in very rare but annoying edge cases.

One of these cases is when a single value is returned from a condition used to subset a DataFrame. Consider this example:

In [76]:
subset = diamonds.loc[diamonds.index < 1, ["price"]]
subset

Unnamed: 0,price
0,326


Even though there is just one cell, it is returned as a DataFrame. This can be annoying since you now have to use `.loc` again with both the column name and index to access the price.

But, if you know `squeeze`, you don't have to. The function enables you to remove an axis from a single-cell DataFrame or Series. For example:

In [77]:
subset.squeeze()

326

Now, only the scalar is returned. It is also possible to specify the axis to remove:

In [78]:
subset.squeeze("columns")  # or "rows"

0    326
Name: price, dtype: int64

Note that `squeeze` only works for DataFrames or Series with single values.

## 6. between

A rather nifty function for boolean indexing numeric features within a range:

In [79]:
# Get diamonds that are priced between 3500 and 3700 dollars
diamonds[diamonds["price"].between(3500, 3700, inclusive="neither")].sample(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_enc
4519,1.0,Premium,G,SI1,60.1,61.0,3634,6.44,6.4,3.86,1
4544,1.0,Premium,J,VS1,59.7,62.0,3640,6.52,6.46,3.87,1
4136,0.95,Very Good,I,SI1,61.0,61.0,3544,6.29,6.37,3.86,3
4301,0.7,Premium,G,IF,62.2,58.0,3591,5.63,5.69,3.52,1
4509,0.74,Ideal,E,VVS2,61.9,57.0,3633,5.79,5.81,3.59,0


## 7. `T`

All DataFrames have a simple `T` attribute, which stands for transpose. You may not use it often, but I find it quite useful when displaying DataFrames of the `describe` method:

In [80]:
## HIDE
from sklearn.datasets import load_boston

bunch = load_boston()
boston = pd.DataFrame(bunch["data"], columns=bunch["feature_names"])

In [81]:
boston.describe().T.head(10)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CRIM,506.0,3.613524,8.601545,0.00632,0.082045,0.25651,3.677083,88.9762
ZN,506.0,11.363636,23.322453,0.0,0.0,0.0,12.5,100.0
INDUS,506.0,11.136779,6.860353,0.46,5.19,9.69,18.1,27.74
CHAS,506.0,0.06917,0.253994,0.0,0.0,0.0,0.0,1.0
NOX,506.0,0.554695,0.115878,0.385,0.449,0.538,0.624,0.871
RM,506.0,6.284634,0.702617,3.561,5.8855,6.2085,6.6235,8.78
AGE,506.0,68.574901,28.148861,2.9,45.025,77.5,94.075,100.0
DIS,506.0,3.795043,2.10571,1.1296,2.100175,3.20745,5.188425,12.1265
RAD,506.0,9.549407,8.707259,1.0,4.0,5.0,24.0,24.0
TAX,506.0,408.237154,168.537116,187.0,279.0,330.0,666.0,711.0


The Boston housing dataset has 30 numeric columns. If you call `describe` as-is, the DataFrame will stretch horizontally, making it hard to compare the statistics. Taking the transpose will switch the axes so that summary statistics are given in columns.

## 8. Pandas Styler

Did you know that Pandas allows you to style DataFrames?

They have a `style` attribute which opens doors to customizations and styles only limited by your HTML and CSS knowledge. I won't discuss the full details of what you can do with `style` but only show you my favorite functions:

In [85]:
## HIDE
diabetes = pd.read_csv("https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/8_august/1_pandas_funcs/data/diabetes.csv")

In [86]:
diabetes.describe().T.drop("count", axis=1).style.highlight_max(color="darkred")

Unnamed: 0,mean,std,min,25%,50%,75%,max
Pregnancies,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


Above, we are highlighting cells that hold the maximum value of a column. Another cool styler is `background_gradient` which can give columns a gradient background color based on their values:

In [87]:
diabetes.describe().T.drop("count", axis=1).style.background_gradient(
    subset=["mean", "50%"], cmap="Reds"
)

Unnamed: 0,mean,std,min,25%,50%,75%,max
Pregnancies,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


This feature comes especially handy when you are using `describe` on a table with many columns and want to compare summary statistics. Check out the documentation of the styler [here](https://pandas.pydata.org/docs/reference/style.html).

## 9. Pandas options

Like Matplotlib, pandas has global settings that you can tweak to change the default behaviors:

In [88]:
dir(pd.options)

['compute', 'display', 'io', 'mode', 'plotting', 'styler']

These settings are divided into 5 modules. Let's see what settings are there under `display`:

In [89]:
dir(pd.options.display)

['chop_threshold',
 'colheader_justify',
 'column_space',
 'date_dayfirst',
 'date_yearfirst',
 'encoding',
 'expand_frame_repr',
 'float_format',
 'html',
 'large_repr',
 'latex',
 'max_categories',
 'max_columns',
 'max_colwidth',
 'max_info_columns',
 'max_info_rows',
 'max_rows',
 'max_seq_items',
 'memory_usage',
 'min_rows',
 'multi_sparse',
 'notebook_repr_html',
 'pprint_nest_depth',
 'precision',
 'show_dimensions',
 'unicode',
 'width']

There are options under `display` but I mostly use `max_columns` and `precision`:

In [90]:
# Remove the limit to display the number of cols
pd.options.display.max_columns = None

# Only show 5 numbers after the decimal
pd.options.display.precision = 5  # gets rid of scientific notation

You can check out the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html) to dig deeper into this wonderful feature.

## 10. `convert_dtypes`

We all know that pandas has an annoying tendency to mark some columns as `object` data type. Instead of manually specifying their types, you can use `convert_dtypes` method which tries to infer the best data type:

In [91]:
sample = pd.read_csv(
    "https://raw.githubusercontent.com/BexTuychiev/medium_stories/master/2021/8_august/1_pandas_funcs/data/station_day.csv",
    usecols=["StationId", "CO", "O3", "AQI_Bucket"],
)
sample.dtypes

StationId      object
CO            float64
O3            float64
AQI_Bucket     object
dtype: object

In [92]:
sample.convert_dtypes().dtypes

StationId      string
CO            Float64
O3            Float64
AQI_Bucket     string
dtype: object

Unfortunately, it can't pares dates due to the caveats of different date time formats.

## 11. `select_dtypes`

A function I use all the time is `select_dtypes`. I think it is obvious what the function does from its name. It has `include` and `exclude` parameters that you can use to select columns including or excluding certain data types.

For example, choose only numeric columns with `np.number`:

In [93]:
# Choose only numerical columns
diamonds.select_dtypes(include=np.number).head()

Unnamed: 0,carat,depth,table,price,x,y,z,cut_enc
0,0.23,61.5,55.0,326,3.95,3.98,2.43,0
1,0.21,59.8,61.0,326,3.89,3.84,2.31,1
2,0.23,56.9,65.0,327,4.05,4.07,2.31,2
3,0.29,62.4,58.0,334,4.2,4.23,2.63,1
4,0.31,63.3,58.0,335,4.34,4.35,2.75,2


Or `exclude` them:

In [94]:
# Exclude numerical columns
diamonds.select_dtypes(exclude=np.number).head()

Unnamed: 0,cut,color,clarity
0,Ideal,E,SI2
1,Premium,E,SI1
2,Good,E,VS1
3,Premium,I,VS2
4,Good,J,SI2


## 12. `mask`

`mask` allows you to quickly replace cell values where a custom condition is true. 

For example, let's say we have a survey data collected from people aged 50-60.

In [95]:
# Create sample data
ages = pd.Series([55, 52, 50, 66, 57, 59, 49, 60]).to_frame("ages")

ages

Unnamed: 0,ages
0,55
1,52
2,50
3,66
4,57
5,59
6,49
7,60


We will treat ages that are outside 50-60 range (there are two, 49 and 66) as data entry mistakes and replace them with NaNs.

In [96]:
ages.mask(cond=~ages["ages"].between(50, 60), other=np.nan)

Unnamed: 0,ages
0,55.0
1,52.0
2,50.0
3,
4,57.0
5,59.0
6,
7,60.0


So, `mask` replaces values that don't meet `cond` with `other`.

## 13. `min` and `max` along the columns axis

Even though `min` and `max` functions are well-known, they have another useful property for some edge-cases. Consider this dataset:

In [97]:
index = ["Diamonds", "Titanic", "Iris", "Heart Disease", "Loan Default"]
libraries = ["XGBoost", "CatBoost", "LightGBM", "Sklearn GB"]
df = pd.DataFrame(
    {lib: np.random.uniform(90, 100, 5) for lib in libraries}, index=index
)

df

Unnamed: 0,XGBoost,CatBoost,LightGBM,Sklearn GB
Diamonds,93.3876,97.50072,90.52697,91.82633
Titanic,91.61701,92.58733,91.46973,93.93463
Iris,93.99501,91.87093,92.58139,95.18023
Heart Disease,97.70143,97.53937,95.23871,99.86414
Loan Default,93.93219,97.66439,92.55042,92.08887


The above fake DataFrame is a point-performance of 4 different gradient boosting libraries on 5 datasets. We want to find the library that performed best at each dataset. Here is how you do it elegantly with `max`:

In [98]:
df.max(axis=1)

Diamonds         97.50072
Titanic          93.93463
Iris             95.18023
Heart Disease    99.86414
Loan Default     97.66439
dtype: float64

Just change the axis to 1 and you get a row-wise max/min. 

## 14. `nlargest` and `nsmallest`

Sometimes you don't just want the min/max of a column. You want to see the top N or ~(top N) values of a variable. This is where `nlargest` and `nsmallest` comes in handy.

Let's see the top 5 most expensive and cheapest diamonds:

In [99]:
diamonds.nlargest(5, "price")

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_enc
27749,2.29,Premium,I,VS2,60.8,60.0,18823,8.5,8.47,5.16,1
27748,2.0,Very Good,G,SI1,63.5,56.0,18818,7.9,7.97,5.04,3
27747,1.51,Ideal,G,IF,61.7,55.0,18806,7.37,7.41,4.56,0
27746,2.07,Ideal,G,SI2,62.5,55.0,18804,8.2,8.13,5.11,0
27745,2.0,Very Good,H,SI1,62.8,57.0,18803,7.95,8.0,5.01,3


In [100]:
diamonds.nsmallest(5, "price")

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_enc
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,0
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,1
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,2
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,1
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,2


## 15. `idxmax` and `idxmin`

When you call `max` or `min` on a column, pandas returns the value that is largest/smallest. However, sometimes you want the *position* of the min/max, which is not possible with these functions.

Instead, you should use `idxmax`/`idxmin`:

In [101]:
diamonds.price.idxmax()

27749

In [102]:
diamonds.carat.idxmin()

14

You can also specify the `columns` axis, in which case the functions return the index number of the column.

## 16. `value_counts` with `dropna=False`

A common operation to find the percentage of missing values in a column is to chain `isnull` and `sum` and divide by the length of the array. 

But, you can do the same thing with `value_counts` with relevant arguments:

In [123]:
ames_housing = pd.read_csv("gender_submission.csv")

ames_housing["Survived"].value_counts(dropna=False, normalize=True)

0    0.63636
1    0.36364
Name: Survived, dtype: float64

Fireplace quality of Ames housing dataset consists of 47% nulls.

## 17. `clip`

Outlier detection and removal is common in data analysis. 

`clip` function makes it really easy to find outliers outside a range and replacing them with the hard limits. 

Let's go back to the ages example:

In [124]:
ages

Unnamed: 0,ages
0,55
1,52
2,50
3,66
4,57
5,59
6,49
7,60


This time, we will replace the out-of-range ages with the hard limits of 50 and 60:

In [125]:
ages.clip(50, 60)

Unnamed: 0,ages
0,55
1,52
2,50
3,60
4,57
5,59
6,50
7,60


Fast and efficient!

## 18. `at_time` and `between_time`

These two can be useful when working with time-series that have high granularity. 

`at_time` allows you to subset values at a specific date or time. Consider this time series:

In [126]:
index = pd.date_range("2021-08-01", periods=100, freq="H")
data = pd.DataFrame({"col": list(range(100))}, index=index)

data.head()

Unnamed: 0,col
2021-08-01 00:00:00,0
2021-08-01 01:00:00,1
2021-08-01 02:00:00,2
2021-08-01 03:00:00,3
2021-08-01 04:00:00,4


Let's select all rows at 3 PM:

In [127]:
data.at_time("15:00")

Unnamed: 0,col
2021-08-01 15:00:00,15
2021-08-02 15:00:00,39
2021-08-03 15:00:00,63
2021-08-04 15:00:00,87


Cool, huh? Now, let's use `between_time` to select rows within a custom interval:

In [128]:
from datetime import datetime

data.between_time("09:45", "12:00")

Unnamed: 0,col
2021-08-01 10:00:00,10
2021-08-01 11:00:00,11
2021-08-01 12:00:00,12
2021-08-02 10:00:00,34
2021-08-02 11:00:00,35
2021-08-02 12:00:00,36
2021-08-03 10:00:00,58
2021-08-03 11:00:00,59
2021-08-03 12:00:00,60
2021-08-04 10:00:00,82


Note that both functions require a DateTimeIndex and they only work with times (as in *o'clock*). If you want to subset within a DateTime interval, use `between`.

## 19. `bdate_range`

`bdate_range` is a short-hand function to create TimeSeries indices with business-day frequency:

In [129]:
series = pd.bdate_range("2021-01-01", "2021-01-31")  # A period of one month
len(series)

21

Business-day frequencies are common in the financial world. So, this function may come in handy when reindexing existing time-series with `reindex` function.

## 20. `autocorr`

One of the critical components in time-series analysis is examining the autocorrelation of a variable. 

Autocorrelation is the plain-old correlation coefficient but it is calculated with the lagging version of a time series. 

In more detail, autocorrelation of a time series at `lag=k` is calculated as follows:

1. The time-series is shifted till `k` periods:

In [130]:
## HIDE
# Prep the data for an example
dt = pd.date_range("2021-01-01", periods=len(tips))
tips.index = dt

time_series = tips[["tip"]]

In [131]:
time_series["lag_1"] = time_series["tip"].shift(1)
time_series["lag_2"] = time_series["tip"].shift(2)
time_series["lag_3"] = time_series["tip"].shift(3)
time_series["lag_4"] = time_series["tip"].shift(4)
# time_series['lag_k'] = time_series['tip'].shift(k)

time_series.head()

Unnamed: 0,tip,lag_1,lag_2,lag_3,lag_4
2021-01-01,1.01,,,,
2021-01-02,1.66,1.01,,,
2021-01-03,3.5,1.66,1.01,,
2021-01-04,3.31,3.5,1.66,1.01,
2021-01-05,3.61,3.31,3.5,1.66,1.01


2. Correlation is calculated between the original `tip` and each `lag_*`. 

Instead of doing all this manually, you can use the `autocorr` function of Pandas:

In [132]:
# Autocorrelation of tip at lag_10
time_series["tip"].autocorr(lag=8)

0.07475238789967067

You can read more about the importance of autocorrelation in time-series analysis from this [post](https://towardsdatascience.com/advanced-time-series-analysis-in-python-decomposition-autocorrelation-115aa64f475e).

## 21. `hasnans`

Pandas offers a quick method to check if a given series contains any nulls with `hasnans` attribute:

In [133]:
series = pd.Series([2, 4, 6, "sadf", np.nan])
series.hasnans

True

According to its [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.hasnans.html), it enables various performance increases. Note that the attribute works only on `pd.Series`.

## 22. `at` and `iat`

These two accessors are much faster alternatives to `loc` and `iloc` with a disadvantage. They only allow selecting or replacing a single value at a time:

In [134]:
# [index, label]
diamonds.at[234, "cut"]

'Ideal'

In [135]:
# [index, index]
diamonds.iat[1564, 4]

61.2

In [136]:
# Replace 16541th row of the price column
diamonds.at[16541, "price"] = 10000

## 23. `argsort`

You should use this function when you want to extract the indices that would sort an array:

In [137]:
tips.reset_index(inplace=True, drop=True)

sort_idx = tips["total_bill"].argsort(kind="mergesort")

# Now, sort `tips` based on total_bill
tips.iloc[sort_idx].head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
67,3.07,1.0,Female,Yes,Sat,Dinner,1
92,5.75,1.0,Female,Yes,Fri,Dinner,2
111,7.25,1.0,Female,No,Sat,Dinner,1
172,7.25,5.15,Male,Yes,Sun,Dinner,2
149,7.51,2.0,Male,No,Thur,Lunch,2


## 24. `cat` accessor

It is common knowledge that Pandas enables to use built-in Python functions on dates and strings using accessors like `dt` or `str`. 

Pandas also has a special `category` data type for categorical variables as can be seen below:

In [138]:
diamonds.dtypes

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
cut_enc      int64
dtype: object

When a column is `category`, you can use several special functions using the `cat` accessor. For example, let's see the unique categories of diamond cuts:

In [145]:
diamonds["cut"].cat.categories

AttributeError: Can only use .cat accessor with a 'category' dtype

There are also functions like `remove_categories` or `rename_categories`, etc.:

In [146]:
diamonds["new_cuts"] = diamonds["cut"].cat.rename_categories(list("ABCDE"))
diamonds["new_cuts"].cat.categories

AttributeError: Can only use .cat accessor with a 'category' dtype

You can see the full list of functions under the `cat` accessor [here](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#categorical-accessor).

## 25. `GroupBy.nth`

This function only works with `GroupBy` objects. Specifically, after grouping, `nth` returns the nth row from each group:

In [151]:
diamonds.groupby("cut").nth(5)

Unnamed: 0_level_0,carat,color,clarity,depth,table,price,x,y,z,cut_enc
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Fair,0.91,H,SI2,64.4,57.0,2763,6.11,6.09,3.93,4
Good,0.3,I,SI2,63.3,56.0,351,4.26,4.3,2.71,2
Ideal,0.33,I,SI2,61.2,56.0,403,4.49,4.5,2.75,0
Premium,0.24,I,VS1,62.5,57.0,355,3.97,3.94,2.47,1
Very Good,0.23,E,VS2,63.8,55.0,352,3.85,3.92,2.48,3


## Summary

Even though libraries like Dask and datatable are slowly winning over Pandas with their shiny new features for handling massive datasets, Pandas still remains the most widely-used data manipulation tool in Python data science ecosystem.

The library still remains as a role-model for other packages to imitate and improve upon, as it integrates into the modern SciPy stack so well. Thank you for reading!

## You might also be interested...
- [My 6-part Powerful EDA Template](https://www.kaggle.com/bextuychiev/my-6-part-powerful-eda-template)
- [
Lasso regression with Pipelines (Tutorial)](https://www.kaggle.com/bextuychiev/lasso-regression-with-pipelines-tutorial)
- [Awesome EDA + XGBoost CV Baseline](https://www.kaggle.com/bextuychiev/relevant-eda-xgboost-cv-baseline)