<h1 style="margin-bottom:0"><center>DI 501 - Introduction to Data Informatics</center></h1>
<h2 style="margin-top:0"><center>Pandas Tutorial</center></h2>
<br>
<p style="margin-top:0"><center><b>This tutorial is prepared for Middle East Technical University's DI 501 - Introduction to Data Informatics course.</b></center></p>
<hr style="height:2px;color:navy;margin-top:0">
<p style="margin-top:0; text-align: justify; font-size:15px">Pandas is a widely used Python library that is open source and mostly used by scientist and engineers. It is a core library for data related science. This library provides high-performance, easy-to-use data structures and data analysis tools for the Python.</p>
<p style="margin-top:1; text-align: justify; font-size:15px">Pandas library is huge. There are tons of built in functions for pandas. In this tutorial, we will be covering some fundamental ones. You can always find useful functionalities of Pandas library by searching online. </p>

<h3 style="margin-bottom:0">1) Installation</h3>
<br>
<p style="margin-top:0; text-align: justify">To install Pandas, you are required to have Python environment first. If you do not have Python, you are strongly recommended to have <a href="https://www.anaconda.com/">Anaconda</a> distribution as it is beginner friendly. </p>
<p style="margin-top:1; text-align: justify">If you have Python, you can proceed to install Pandas with the following code: </p>

In [1]:
conda install pandas

Channels:
 - defaults
Platform: osx-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/anaconda3

  added / updated specs:
    - pandas


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openpyxl-3.1.5             |  py311h46256e1_1         691 KB
    pandas-2.2.3               |  py311h6d0c2b6_0        14.9 MB
    ------------------------------------------------------------
                                           Total:        15.6 MB

The following packages will be UPDATED:

  openpyxl                           3.0.10-py311h6c40b1e_0 --> 3.1.5-py311h46256e1_1 
  pandas                              2.1.4-py311hdb55bb0_0 --> 2.2.3-py311h6d0c2b6_0 



Downloading and Extracting Packages:
pandas-2.2.3         | 14.9 MB   |                                       |   0% 
pandas-2.2.3         | 14.9 MB   |         

or

In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


<p style="margin-top:0; text-align: justify">You can use those codes on command prompt or Anaconda prompt to install Pandas. If you have any problems, you can refer to <a href="https://www.geeksforgeeks.org/how-to-install-python-pandas-on-windows-and-linux/">this website</a> or you can directly ask to assistants of the course. </p>

<h3 style="margin-bottom:0">2) Importing</h3>
<br>
<p style="margin-top:0; text-align: justify">To be able to use Pandas library, you first need to import it. It is a widely used practice to abbreviate Pandas as pd. </p>

In [3]:
import pandas as pd
import numpy as np

<h3 style="margin-bottom:0">3) Creating DataFrames and Series</h3>
<br>
<p style="margin-top:0; text-align: justify"><b>DataFrames</b> are the most common structured API and they simply represent a table of data with rows and columns. DataFrames are the underlying structure of Pandas library. They are commonly abbreviated as 'df'. On the other hand, <b>series</b> are one-dimensional data structures. If you know some linear algebra, you can think series as column vectors and dataframes as matrices. </p>
<br>
<p style="margin-top:0; text-align: justify">DataFrames and Series can be built with following syntax:</p>

In [4]:
df1 = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6], 'z':[7,8,9]})
df1

Unnamed: 0,x,y,z
0,1,4,7
1,2,5,8
2,3,6,9


If we want to create series from this dataframe, we can simply assign a column to a new variable.

In [5]:
srs = df1.x
srs

0    1
1    2
2    3
Name: x, dtype: int64

If we create the dataframe with the syntax above, row names (indexes) are automatically given as 0, 1, 2... so on. However, we can put "index =" if we want to specify index names. So, instead, we can create a dataframe as follows:

In [6]:
df2 = pd.DataFrame({'x':[10,20,30], 'y':[40,50,60], 'z':[70,80,90]}, index = ['a','b','c'])
df2

Unnamed: 0,x,y,z
a,10,40,70
b,20,50,80
c,30,60,90


Finally, we can create dataframes with multiple indexes by passing pd.MultiIndex.from_tuples to "index =" part:

In [7]:
df3 = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6], 'z':[7,8,9]}, index = pd.MultiIndex.from_tuples([('a','b'),('a','c'),('d','c')], names = ['m','r']))
df3

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y,z
m,r,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,b,1,4,7
a,c,2,5,8
d,c,3,6,9


<h3 style="margin-bottom:0">4) Subsetting DataFrames </h3>

<p style="margin-top:0; text-align: justify">In this section, we will pick a dataframe and have a look at different ways to query data. Scikit-learn library have a lot of useful built in datasets and we will be utilizing one of them. It is possible to find pandas version of those sets online. You can check this <a href="https://github.com/mwaskom/seaborn-data">Git-Hub link.</a></p>

In [8]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/diamonds.csv')

<p style="margin-top:0; text-align: justify"> Diamonds dataset contains the prices and other attributes of almost 54,000 diamonds. It's a good dataset for beginners to learn working with datasets. Source of this data is <a href="https://www.kaggle.com/shivam2503/diamonds">here.</a></p>

<p style="margin-top:0; text-align: justify"> We can start inspecting our dataframe by looking at few rows. It is a good practice to check it before doing any analysis or modifications to prevent any kind of loading error. We can check first n rows by simply putting ".head(n)" </a></p>

In [9]:
df.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


We can also check last n rows by adding ".tail(n)", by this way we can also learn how many inputs we have.

In [10]:
df.tail(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


We can check shape of our dataframe with ".shape". It will return to a tuple of (row, column).

In [11]:
df.shape

(53940, 10)

We can pick a sample of n rows by adding ".sample(n)". We can also sample a fraction by adding for example ".sample(frac=0.01). We generally add "random_state = x". This acts like a seed, each time we run the code, we will consistently get the same result.

In [12]:
df.sample(10, random_state=123)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
32685,0.31,Premium,G,VS1,63.0,57.0,802,4.32,4.28,2.71
36258,0.46,Ideal,E,SI2,61.3,55.0,935,4.98,5.0,3.06
14429,1.21,Very Good,H,SI1,62.4,58.0,5826,6.83,6.79,4.25
36250,0.41,Ideal,E,VS2,62.1,57.0,935,4.74,4.79,2.96
460,0.9,Ideal,J,VS2,62.8,55.0,2817,6.2,6.16,3.88
34249,0.38,Ideal,G,SI1,62.4,57.0,855,4.65,4.62,2.89
665,0.73,Ideal,E,SI1,59.1,59.0,2846,5.92,5.95,3.51
36097,0.32,Premium,F,VVS1,61.3,58.0,926,4.37,4.4,2.69
26430,2.0,Good,E,SI2,60.1,54.0,15962,8.01,8.15,4.86
13170,1.06,Very Good,E,SI1,63.7,55.0,5445,6.45,6.49,4.12


If we want to select rows by indexes, we can use ".iloc[ : ]". Note that indexing starts from 0. Also, number on the left will be inclusive and number on the right will be exclusive.

In [13]:
df.iloc[10:20]

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
10,0.3,Good,J,SI1,64.0,55.0,339,4.25,4.28,2.73
11,0.23,Ideal,J,VS1,62.8,56.0,340,3.93,3.9,2.46
12,0.22,Premium,F,SI1,60.4,61.0,342,3.88,3.84,2.33
13,0.31,Ideal,J,SI2,62.2,54.0,344,4.35,4.37,2.71
14,0.2,Premium,E,SI2,60.2,62.0,345,3.79,3.75,2.27
15,0.32,Premium,E,I1,60.9,58.0,345,4.38,4.42,2.68
16,0.3,Ideal,I,SI2,62.0,54.0,348,4.31,4.34,2.68
17,0.3,Good,J,SI1,63.4,54.0,351,4.23,4.29,2.7
18,0.3,Good,J,SI1,63.8,56.0,351,4.23,4.26,2.71
19,0.3,Very Good,J,SI1,62.7,59.0,351,4.21,4.27,2.66


We can also look at whether we have null values or not by checking ".isnull"

In [14]:
df.isnull().head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


We can select and order top n etries based on an attribute. If we want to retrieve 5 smallest depth diamonds, we can use (for largest we can use nlargest as well): 

In [15]:
df.nsmallest(5, 'depth')

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
4518,1.0,Fair,G,SI1,43.0,59.0,3634,6.32,6.27,3.97
10377,1.09,Ideal,J,VS2,43.0,54.0,4778,6.53,6.55,4.12
6341,1.0,Fair,G,VS2,44.0,53.0,4032,6.31,6.24,4.12
16857,1.43,Fair,I,VS1,50.8,60.0,6727,7.73,7.25,3.93
36503,0.3,Fair,E,VVS2,51.0,67.0,945,4.67,4.62,2.37


<p style="margin-top:0; text-align: justify"> We can also subset columns. We can select several columns (carat, cut and price) by:  </a></p>

In [16]:
df[['carat','cut','price']].head(5)

Unnamed: 0,carat,cut,price
0,0.23,Ideal,326
1,0.21,Premium,326
2,0.23,Good,327
3,0.29,Premium,334
4,0.31,Good,335


Also, we can retrieve them by their indexes (first colon means retrieve everything in those indexes):

In [17]:
df.iloc[:,[0,1,6]].head(5)

Unnamed: 0,carat,cut,price
0,0.23,Ideal,326
1,0.21,Premium,326
2,0.23,Good,327
3,0.29,Premium,334
4,0.31,Good,335


Finally, we can select rows meeting a logical condition. If we want to retrieve rows where cut is ideal, we can specify:

In [18]:
df.loc[df['cut'] == 'Ideal']

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
11,0.23,Ideal,J,VS1,62.8,56.0,340,3.93,3.90,2.46
13,0.31,Ideal,J,SI2,62.2,54.0,344,4.35,4.37,2.71
16,0.30,Ideal,I,SI2,62.0,54.0,348,4.31,4.34,2.68
39,0.33,Ideal,I,SI2,61.8,55.0,403,4.49,4.51,2.78
...,...,...,...,...,...,...,...,...,...,...
53925,0.79,Ideal,I,SI1,61.6,56.0,2756,5.95,5.97,3.67
53926,0.71,Ideal,E,SI1,61.9,56.0,2756,5.71,5.73,3.54
53929,0.71,Ideal,G,VS1,61.4,56.0,2756,5.76,5.73,3.53
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50


<h3 style="margin-bottom:0">5) Reshaping Data </h3>

<p style="margin-top:0; text-align: justify">In this section, we will look at how we can reshape data.</p>

We can convert rows into columns by:

In [19]:
pd.melt(df1)

Unnamed: 0,variable,value
0,x,1
1,x,2
2,x,3
3,y,4
4,y,5
5,y,6
6,z,7
7,z,8
8,z,9


We can also concatenate two dataframes. Default parameter in this function is axis=0, meaning we work on rows. If you pass axis=1 as we did below, you can concatenate two dataframes by columns.

In [20]:
pd.concat([df1,df2])

Unnamed: 0,x,y,z
0,1,4,7
1,2,5,8
2,3,6,9
a,10,40,70
b,20,50,80
c,30,60,90


In [21]:
pd.concat([df1,df2], axis=1)

Unnamed: 0,x,y,z,x.1,y.1,z.1
0,1.0,4.0,7.0,,,
1,2.0,5.0,8.0,,,
2,3.0,6.0,9.0,,,
a,,,,10.0,40.0,70.0
b,,,,20.0,50.0,80.0
c,,,,30.0,60.0,90.0


This is a general rule in pandas. axis = 0 does the function row-wise and axis = 1 does the function column-wise.

We can sort dataframes by a certain value. Let's sort our diamond dataframe (df) by carat value. Default is ascending, if you pass ascending=False, you can get descending results.

In [22]:
df.sort_values('carat', ascending=False)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
27415,5.01,Fair,J,I1,65.5,59.0,18018,10.74,10.54,6.98
27630,4.50,Fair,J,I1,65.8,58.0,18531,10.23,10.16,6.72
27130,4.13,Fair,H,I1,64.8,61.0,17329,10.00,9.85,6.43
25999,4.01,Premium,J,I1,62.5,62.0,15223,10.02,9.94,6.24
25998,4.01,Premium,I,I1,61.0,61.0,15223,10.14,10.10,6.17
...,...,...,...,...,...,...,...,...,...,...
31592,0.20,Premium,E,VS2,59.0,60.0,367,3.81,3.78,2.24
31591,0.20,Premium,E,VS2,59.8,62.0,367,3.79,3.77,2.26
31601,0.20,Premium,D,VS2,61.7,60.0,367,3.77,3.72,2.31
14,0.20,Premium,E,SI2,60.2,62.0,345,3.79,3.75,2.27


Finally, we can add new columns. If we want to get a final column which is a multiply of 'x', 'y' and 'z' values, we can create it by:

In [23]:
df['multiplication'] = df.x * df.y * df.z
df.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,multiplication
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,38.20203
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,34.505856
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,38.076885
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63,46.72458
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,51.91725


<h3 style="margin-bottom:0">6) Grouping Data </h3>

Grouping of data can be done in pandas. It is very useful for getting insight from our data. For example, if we want to look at the average price of different kind of diamond cuts, we can write:

In [30]:
df.groupby('cut')['price'].mean()

cut
Fair         4358.757764
Good         3928.864452
Ideal        3457.541970
Premium      4584.257704
Very Good    3981.759891
Name: price, dtype: float64

Here we implied: first, group all data by its 'cut' type, then, calculate mean price. Please note that mean for 'color' or 'clarity' cannot be calculated  because they are categorical data, it is impossible to find a mean.

<h3 style="margin-bottom:0">7) Pandas with Lambda Functions </h3>

A lambda function is a small function containing a single expression. These are very helpful when we have to perform small tasks with less code. We can combine lambda functions with Pandas library.

For example, assume that due to inflation, each of the diamonds' price goes up by 10%. We can write a simple lambda impression to modify our dataframe, such as:

In [None]:
df['price'] = df['price'].apply(lambda x: (x*1.1))
df.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,multiplication
0,0.23,Ideal,E,SI2,61.5,55.0,360.8,3.95,3.98,2.43,38.20203
1,0.21,Premium,E,SI1,59.8,61.0,360.8,3.89,3.84,2.31,34.505856
2,0.23,Good,E,VS1,56.9,65.0,361.9,4.05,4.07,2.31,38.076885
3,0.29,Premium,I,VS2,62.4,58.0,369.6,4.2,4.23,2.63,46.72458
4,0.31,Good,J,SI2,63.3,58.0,370.7,4.34,4.35,2.75,51.91725


Here, we basically said, x being an element of df['price'], change x with x*1.1.

We can also add if statements into our expressions. For example, assume that government imposes limitations on price, it cannot be lower than 400 and it cannot be higher than 15000. So, to change prices accordingly, we can add nested if statements such as:

In [None]:
df['price'] = df['price'].apply(lambda x: 400 if x<400 else (15000 if x>15000 else x))
df.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,multiplication
0,0.23,Ideal,E,SI2,61.5,55.0,400.0,3.95,3.98,2.43,38.20203
1,0.21,Premium,E,SI1,59.8,61.0,400.0,3.89,3.84,2.31,34.505856
2,0.23,Good,E,VS1,56.9,65.0,400.0,4.05,4.07,2.31,38.076885
3,0.29,Premium,I,VS2,62.4,58.0,400.0,4.2,4.23,2.63,46.72458
4,0.31,Good,J,SI2,63.3,58.0,400.0,4.34,4.35,2.75,51.91725


<h3 style="margin-bottom:0">8) Input and Output</h3>
<br>
<p style="margin-top:0; text-align: justify">In this section, we will present how to save or load Pandas dataframes.</p>
<br>
<p style="margin-top:0; text-align: justify">For example, you can save your dataframe as a Pandas file by putting desired direction in "..." and properly naming it:</p>

In [None]:
pd.to_csv('.../new_file','name')

You can load Pandas dataframe by:

In [None]:
pd.read_csv('.../new_file.csv')

<h3 style="margin-bottom:0">9) Help, References & Useful Links</h3>

You can look at how a function works directly in Jupyter Notebook. For example, if you do not understand how 'iloc' works, you can simply write:

In [None]:
help(pd.Series.iloc)

Help on property:

    Purely integer-location based indexing for selection by position.
    
    ``.iloc[]`` is primarily integer position based (from ``0`` to
    ``length-1`` of the axis), but may also be used with a boolean
    array.
    
    Allowed inputs are:
    
    - An integer, e.g. ``5``.
    - A list or array of integers, e.g. ``[4, 3, 0]``.
    - A slice object with ints, e.g. ``1:7``.
    - A boolean array.
    - A ``callable`` function with one argument (the calling Series or
      DataFrame) and that returns valid output for indexing (one of the above).
      This is useful in method chains, when you don't have a reference to the
      calling object, but would like to base your selection on some value.
    
    ``.iloc`` will raise ``IndexError`` if a requested indexer is
    out-of-bounds, except *slice* indexers which allow out-of-bounds
    indexing (this conforms with python/numpy *slice* semantics).
    
    See more at :ref:`Selection by Position <indexing.inte

<hr style="height:2px;color:navy;margin-top:0">
<p style="margin-top:0; text-align: justify">This tutorial is prepared with the help of <a href="https://pandas.pydata.org/docs/user_guide/index.html#user-guide">original website</a> documentation.</p>
<br>
<p style="margin-top:0; text-align: justify">You can always refer to this documentation as it is complete. If you cannot find a solution, you are very likely to find an answer for your questions on the internet as this library is super widely used. If you still cannot find an answer, please do no hesitate to ask your questions to course assistants.</p>

<h4 style="margin-bottom:0">Useful Links</h4>

<p style="margin-top:0; text-align: justify">Here are some useful tutorials or blogs that are related to Pandas:</p>

<a href="https://pandas.pydata.org/docs/user_guide/index.html#user-guide">Original Website:</a> This is the website of Pandas. You may find tons of useful material that covers each aspect of Pandas library.</p> 

<a href="https://www.youtube.com/watch?v=QUT1VHiLmmI">Complete Python Pandas Data Science Tutorial</a> This is a YouTube video provided by Keith Galli. It basically covers the fundamentals of Pandas.</p> 

<a href="https://pbpython.com/text-cleaning.html">Efficiently Cleaning Text with Pandas:</a> This article is a good example of how you can do data cleaning with Pandas, definitely check it out.</p> 

<a href="https://www.analyticsvidhya.com/blog/2020/03/understanding-transform-function-python/">Learn How to use the Transform Function in Pandas:</a> This blog post is a good example of how you can use transform and apply functions within pandas.</p> 