# Introduction to Pandas in Python

Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures for efficiently storing and processing large data sets and tools for working with data from a variety of sources. This notebook is an introduction to using Pandas in Python

# Table of Content


1. **[Importing Modules (Pandas)](#pandas)**
<br><br> 
2. **[Pandas DataFrame](#dataframes)**
<br><br>
3. **[Manipulating DataFrame](#dataframes)**
<br><br>
4. **[Reading Data from Different Sources](#reading_data)**



<a id="pandas"> </a>
# 1. Pandas

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.
<br><br>
                       Pandas is designed for working with tabular data.<br><br>

A module/library in python is simply a way to organize the code, and it contains either python classes or just functions. 

### Quick look at functions

In [1]:
# what are functions and arguments
def add(x,y):
    c = x+y
    return c

In [2]:
#Explain functions and how import works to students
# import inspect
# print(inspect.getsource(add.ad))

**How to install and import pandas?**<br>
1. Install pandas:<br><br>
`!pip install pandas`<br><br>
2. Import pandas:<br><br>
`import pandas as pd`

In [3]:
#Check the list of base packages
!pip list

Package                       Version
----------------------------- --------------------
alabaster                     0.7.12
anaconda-client               1.11.0
anaconda-navigator            2.3.1
anaconda-project              0.11.1
anyio                         3.5.0
appdirs                       1.4.4
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arrow                         1.2.2
astroid                       2.11.7
astropy                       5.1
atomicwrites                  1.4.0
attrs                         21.4.0
Automat                       20.2.0
autopep8                      1.6.0
Babel                         2.9.1
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
backports.tempfile            1.0
backports.weakref             1.0.post1
bcrypt                        3.2.0
beautifulsoup4                4.11.1
binaryornot                   0.4.4
bitarray                      2.5.1
bkcharts                      0.2
blac

In [4]:
# install pandas
# !pip install pandas

In [5]:
#import pandas 'library/package/modules'
import pandas as pd

 `as` is used as an alias in pandas. So from now on we will use `pd.` instead of `pandas.` 
 
<br>
<span style="color:crimson">Always use libraries if they are freely available. It saves time, and those codes are already tested, debugged and optimized.</span>

<a id="dataframes"> </a>
# 2. Pandas DataFrames

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> A DataFrame is a tabular representation of data containing an ordered collection of columns, each of which can be a different type (numeric, string, boolean, and so on). <br><br>                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## <span style="color:darkgreen;">To read data from a csv file</span>

In [6]:
# read the example.csv file in a dataframe
data = pd.read_csv('example.csv')
data.head(3)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5


In [7]:
# check the type
type(data)

pandas.core.frame.DataFrame

On checking the data type, we notice it is read as pandas data frame.

## <span style="color:darkgreen;">To print top & bottom rows of the data</span>

In [8]:
#top rows
data.head(3)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45,60,1.35
1,12,43,1.21
2,54,78,1.5


By default, the `.head()` will display **first** five rows. However, we can set the desired number of rows to be displayed.

In [9]:
#bottom rows
data.tail(3)

Unnamed: 0,Age,Weight (in kg),Height (in m)
20,68,50,1.32
21,56,76,1.69
22,67,78,1.85


By default, the `.tail()` will display **last** five rows. However, we can set the desired number of rows to be displayed.

## <span style="color:darkgreen;">To obtain the dimensions of the data</span>

In [10]:
data.shape

(23, 3)

## <span style="color:darkgreen;">To know the data types of a data frame</span>

In [11]:
data.dtypes

Age                 int64
Weight (in kg)      int64
Height (in m)     float64
dtype: object

We see the data type of each variable.

## <span style="color:darkgreen;">Print more information about the data</span>

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             23 non-null     int64  
 1   Weight (in kg)  23 non-null     int64  
 2   Height (in m)   23 non-null     float64
dtypes: float64(1), int64(2)
memory usage: 680.0 bytes


We see this output gives the number of rows present in the data `RangeIndex: 23 entries, 0 to 22` There are 23 rows numbered from 0 to 22. And there are a total of three columns - `Data columns (total 3 columns)`. 

Consider `Age 23 non-null int64` indicates that the column named 'Age' has 23 non-null observations having the data type 'int64'

And finally the memory used to save this dataframe is 680 bytes.

In [13]:
# describe your data
data.describe()

Unnamed: 0,Age,Weight (in kg),Height (in m)
count,23.0,23.0,23.0
mean,44.521739,58.304348,1.528261
std,20.586557,19.401112,0.227309
min,10.0,21.0,1.21
25%,26.0,44.0,1.32
50%,54.0,65.0,1.52
75%,62.0,76.0,1.69
max,75.0,89.0,1.85


In [14]:
# we can even transpose it, for a better view
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,23.0,44.521739,20.586557,10.0,26.0,54.0,62.0,75.0
Weight (in kg),23.0,58.304348,19.401112,21.0,44.0,65.0,76.0,89.0
Height (in m),23.0,1.528261,0.227309,1.21,1.32,1.52,1.69,1.85


# `.loc` and `.iloc` methods

 In Pandas, the .loc method is used to select rows and columns from a DataFrame based on labels. This method is used to access a group of rows and columns by labels or a boolean array.

## <span style="color:darkgreen;">Indexing a dataframe using `.loc`</span>

`DataFrame.loc[]` is label-based method, which means that you have to specify rows and columns based on their row and column labels.

In [15]:
#syntax
#dataframe.loc[rows, columns]

In [16]:
data.loc[0,]

Age               45.00
Weight (in kg)    60.00
Height (in m)      1.35
Name: 0, dtype: float64

In [17]:
data.loc[0,'Age']

45

## <span style="color:darkgreen;">Selecting multiple rows</span>

In [18]:
data.loc[[4,7,10]]

Unnamed: 0,Age,Weight (in kg),Height (in m)
4,68,50,1.32
7,57,34,1.61
10,23,53,1.5


We use two square brackets since we are passing a list of row numbers to be accessed.

## <span style="color:darkgreen;">Selecting a range of rows</span>

In [19]:
data.loc[12:17]

Unnamed: 0,Age,Weight (in kg),Height (in m)
12,55,89,1.65
13,23,45,1.75
14,56,76,1.69
15,67,78,1.85
16,26,65,1.21
17,56,74,1.69


## <span style="color:darkgreen;">Selecting the first column</span>

In [20]:
#data.loc[ro:ws,col:umns]
data.loc[:,'Age']

0     45
1     12
2     54
3     26
4     68
5     21
6     10
7     57
8     75
9     32
10    23
11    34
12    55
13    23
14    56
15    67
16    26
17    56
18    67
19    26
20    68
21    56
22    67
Name: Age, dtype: int64

To select the last column we use -1, to select the second last column we use -2

## <span style="color:darkgreen;">Select the first two columns</span>

In [21]:
data.loc[:,['Age','Weight (in kg)']]

Unnamed: 0,Age,Weight (in kg)
0,45,60
1,12,43
2,54,78
3,26,65
4,68,50
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21


# `.iloc`


In Pandas, the .iloc method is used to select rows and columns from a DataFrame based on their integer position. This method is used to access a group of rows and columns by their integer location.

## <span style="color:darkgreen;">Indexing a dataframe using `.iloc`</span>

`DataFrame.iloc[]` is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

**Note:** the row names are numbers 

In [22]:
# using iloc select 'Age' & 'Weight (in kg)'
data.iloc[:,0:2]

Unnamed: 0,Age,Weight (in kg)
0,45,60
1,12,43
2,54,78
3,26,65
4,68,50
5,21,43
6,10,32
7,57,34
8,75,23
9,32,21


<a id="manipulatingDF"> </a>
# 3. Manipulating a Dataframe

<table align="left">
    <tr>
        <td>
            <div align="left", style="font-size:120%">
                <font color="#21618C">
                    <b> CAUTION:<br>
                        1. DataFrame[column] works for any column name, but DataFrame.column only works when the column name is a valid Python variable name.<br>
                        2. New columns cannot be created with the ` data.BMI ` syntax.
                    </b>
                </font>
            </div>
        </td>
    </tr>
</table>

## <span style="color:darkgreen;">Adding a new column to the dataframe</span>

columns/variables/features mean the same

In [23]:
# create a new column BMI which is given by weight / H**2
data['BMI'] = data['Weight (in kg)']/ data['Height (in m)']**2
data.head()

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45,60,1.35,32.921811
1,12,43,1.21,29.369579
2,54,78,1.5,34.666667
3,26,65,1.21,44.395875
4,68,50,1.32,28.696051


In [24]:
# check the shape of the data
data.shape

(23, 4)

## <span style="color:darkgreen;">Adding a new row to the dataframe</span>

In [25]:
data.loc[23] = [56, 76, 1.69, 26.609713]
data.head(3)

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667


We see that a new row number 23 has be added to the data.

## <span style="color:darkgreen;">Sorting the dataframe</span>

In [26]:
pd??

In [27]:
# sort the data frame on basis of 'Age' values, by default the values will get sorted in ascending order
#Note: 'ascending = False' will sort the data frame in descending order.
data.sort_values('Age', ascending = True)

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
6,10.0,32.0,1.65,11.753903
1,12.0,43.0,1.21,29.369579
5,21.0,43.0,1.52,18.611496
13,23.0,45.0,1.75,14.693878
10,23.0,53.0,1.5,23.555556
19,26.0,65.0,1.21,44.395875
3,26.0,65.0,1.21,44.395875
16,26.0,65.0,1.21,44.395875
9,32.0,21.0,1.52,9.089335
11,34.0,65.0,1.76,20.983988


## <span style="color:darkgreen;">Droping Rows and Columns</span>

In [28]:
# To drop a column
data1 = data.drop('BMI', axis=1)

In [29]:
# dropping a row
data.drop(23)

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


## <span style="color:darkgreen;">Droping duplicates</span>

In [30]:
# Check if data has duplicates
duplicate_count = data.duplicated().sum()

In [31]:
# to drop duplicates from your data
data.drop_duplicates(inplace=True)
data

Unnamed: 0,Age,Weight (in kg),Height (in m),BMI
0,45.0,60.0,1.35,32.921811
1,12.0,43.0,1.21,29.369579
2,54.0,78.0,1.5,34.666667
3,26.0,65.0,1.21,44.395875
4,68.0,50.0,1.32,28.696051
5,21.0,43.0,1.52,18.611496
6,10.0,32.0,1.65,11.753903
7,57.0,34.0,1.61,13.116778
8,75.0,23.0,1.24,14.958377
9,32.0,21.0,1.52,9.089335


## <span style="color:darkgreen;">Checking for missing values</span>

Let's import a new dataset.

In [32]:
# Import missingdata.csv 
mdata = pd.read_csv('missingdata.csv')
mdata.head(3)

Unnamed: 0,Age,Weight (in kg),Height (in m)
0,45.0,60.0,1.35
1,12.0,43.0,1.21
2,54.0,78.0,1.5


In [33]:
# check for nulls
mdata.info()
#mdata.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             22 non-null     float64
 1   Weight (in kg)  21 non-null     float64
 2   Height (in m)   22 non-null     float64
dtypes: float64(3)
memory usage: 680.0 bytes


The function `.isnull` check whether the data is missing. The `sum()` sums the number of 'True' values in the column. The final output gives the number of missing values in each column.

Here, we see there are 2 missing values in the 'weight' column and one missing value in other columns.

<a id="reading_data"> </a>
### Reading Data from Different Sources

Note that the files names are used as examples only. You can try importing your own files to execute the below examples.

**1. Read a `.xlsx` file**

`pd.read_excel('example.xlsx')`

In [34]:
df_xlsx = pd.read_excel('superstore.xlsx')
df_xlsx.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Postal Code,City,...,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit,Shipping Cost,Order Priority
0,40098,CA-2014-AB10015140-41954,2014-11-11,2014-11-13,First Class,AB-100151402,Aaron Bergman,Consumer,73120.0,Oklahoma City,...,TEC-PH-5816,Technology,Phones,Samsung Convoy 3,221.98,2,0.0,62.1544,40.77,High
1,26341,IN-2014-JR162107-41675,2014-02-05,2014-02-07,Second Class,JR-162107,Justin Ritter,Corporate,,Wollongong,...,FUR-CH-5379,Furniture,Chairs,"Novimex Executive Leather Armchair, Black",3709.395,9,0.1,-288.765,923.63,Critical
2,25330,IN-2014-CR127307-41929,2014-10-17,2014-10-18,First Class,CR-127307,Craig Reiter,Consumer,,Brisbane,...,TEC-PH-5356,Technology,Phones,"Nokia Smart Phone, with Caller ID",5175.171,9,0.1,919.971,915.49,Medium
3,13524,ES-2014-KM1637548-41667,2014-01-28,2014-01-30,First Class,KM-1637548,Katherine Murray,Home Office,,Berlin,...,TEC-PH-5267,Technology,Phones,"Motorola Smart Phone, Cordless",2892.51,5,0.1,-96.54,910.16,Medium
4,47221,SG-2014-RH9495111-41948,2014-11-05,2014-11-06,Same Day,RH-9495111,Rick Hansen,Consumer,,Dakar,...,TEC-CO-6011,Technology,Copiers,"Sharp Wireless Fax, High-Speed",2832.96,8,0.0,311.52,903.04,Critical


**2. Read a `.txt` file**

`data = pd.read_csv('example.txt', sep="\t")`

In [35]:
df_txt = pd.read_csv('python.txt')
df_txt

Unnamed: 0,Python is a Programming Language


**3. Read a `.zip` file**

`import zipfile
with zipfile.ZipFile('data.zip') as z:
    with z.open('example.csv') as f:
        file = pd.read_csv(f)
        print(file.head())`

In [36]:
import zipfile
with zipfile.ZipFile('example.zip') as z:
    with z.open('examplee.csv') as f:
        file = pd.read_csv(f)
        print(file.head())

   Sandeep        Sudesh     Kumar
0  Python   Programming   Language


**4. Read a `.html` file**

`df = pd.read_html('example.html', header=1, index_col=0)`

In [37]:
# read data from an HTML table
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
tables = pd.read_html(url)
df_html = tables[0]
df_html.head()

Unnamed: 0,Country / Area,UN continentalregion[4],UN statisticalsubregion[4],Population(1 July 2022),Population(1 July 2023),Change
0,China[a],Asia,Eastern Asia,1425887337,1425671352,−0.02%
1,India,Asia,Southern Asia,1417173173,1428627663,+0.81%
2,United States,Americas,Northern America,338289857,339996564,+0.50%
3,Indonesia,Asia,South-eastern Asia,275501339,277534123,+0.74%
4,Pakistan,Asia,Southern Asia,235824863,240485658,+1.98%


**5. Read a `.json` file**

`pd.read_json('example.json')`

In [38]:
df_json = pd.read_json('example.json')
df_json

Unnamed: 0,firstname,lastname,No,Year,address
permanent_address,Sandeep,Sudesh Kumar,6,1999,Chennai
temporary_address,Sandeep,Sudesh Kumar,6,1999,Stittsville


Congratulations on completing "Introduction to Pandas in Python"! Pandas is a powerful tool for data manipulation and analysis in Python. With this notebook, you have learned how to use Pandas to create, manipulate, and analyze DataFrames, as well as how to read data from different sources. These skills are essential for anyone working with data in Python and will enable you to efficiently process and analyze data of any size and complexity.Remember to keep practicing and experimenting with different code examples, and don't be afraid to ask for help or consult online resources when you need it. Happy coding!

Sandeep Sudesh Kumar  <br>
sandeepsudesh06@gmail.com <br>
www.linkedin.com/in/sandeep-sudesh-kumar
