<a id="top"></a>
# Pandas
A python package which makes importing and analyzing data much easier. <br>
The core component in pandas is a dataframe which is a 2-dimensional special object to store data in tabular form. <br>
Dataframe is the primary Pandas datastructure. <br>
Can be thought of as a dict-like container for Series objects.

Contents :
* <a href="#dataframe">Creation of a Dataframe - pd.DataFrame()</a>
* <a href="#head">df.head()</a>
* <a href="#csv">Dealing with csv files</a>
* <a href="#iloc">df.iloc[]</a>
* <a href="#sort">Sort a dataframe based on columns</a>
* <a href="#numpy">Conversion of dataframe to numpy array</a>

In [1]:
import numpy as np
import pandas as pd
import sys

In [2]:
user_data = {
    "MarksA": np.random.randint(1,100,5),
    "MarksB": np.random.randint(50,100,5),
    "MarksC": np.random.randint(75,100,5)
}
print(user_data)

{'MarksA': array([ 6, 87, 47, 93, 57]), 'MarksB': array([96, 71, 78, 67, 94]), 'MarksC': array([86, 90, 93, 76, 78])}


<a id="dataframe"></a>
### pandas.DataFrame(data=None, index=None, columns=None,  dtype=None)
<b>Parameters</b> <br>
data: ndarray (structured or homogeneous), Iterable, dict, or DataFrame. <br>
       Dict can contain Series, arrays, constants, dataclass or list-like objects.

index: Index or array-like
Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

columns: Index or array-like
Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). 

dtype: dtype, default None.
Data type to force. Only a single dtype is allowed.

In [3]:
# creating dataframe from list
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']

df = pd.DataFrame(lst)
print(df)

        0
0   Geeks
1     For
2   Geeks
3      is
4  portal
5     for
6   Geeks


Since index parameter isn't passed, indices are set to default 0,1,2,3,4,5,6. <br>
Since columns parameter isn't passed, they are also indexed to 0

In [4]:
# Creating DataFrame from dict of ndarray/lists: 

try:
    data = {
        'Name':['Jai', 'Princi', 'Gaurav', 'Anuj', 'Ekta'],
        'Age':[27, 24, 22, 32],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj','Panaji'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd','BTech']
    }
    
    # age has four values only
    df = pd.DataFrame(data)
    
except Exception as e:
    print(f"The error is {e}")
    print(sys.exc_info()[0])
    print(sys.exc_info()[1])

The error is arrays must all be same length
<class 'ValueError'>
arrays must all be same length


<b>NOTE</b>:  To create DataFrame from dict of ndarray/list, all the ndarrays/lists must be of the same length.

In [5]:
# Creating DataFrame from dict of ndarray/lists: 
data = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj', 'Ekta'],
        'Age':[27, 29, 22, 32, 29],
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj','Panaji'],
        'Qualification':['Msc', 'MA', 'MCA', 'Phd','BTech']}
 
df = pd.DataFrame(data)
print(df)

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
4    Ekta   29     Panaji         BTech


<a id="head"></a>
### DataFrame.head (n=5)
Return the first n rows. <br>
This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n]. <br> <br>
<b> Parameters</b>
n: integer value, number of rows to be returned - default 5 <br>
<b> Returns </b>
same type as caller. The first n rows of the caller object.

In [6]:
print(df.head())
print(type(df.head()))

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
4    Ekta   29     Panaji         BTech
<class 'pandas.core.frame.DataFrame'>


In [7]:
print(df.head(n=3))

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA


In [8]:
print(df.head(n=10))

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
4    Ekta   29     Panaji         BTech


In [9]:
print(df.head(-3))

     Name  Age Address Qualification
0     Jai   27   Delhi           Msc
1  Princi   29  Kanpur            MA


In [10]:
print(df.columns)
print(type(df.columns))

Index(['Name', 'Age', 'Address', 'Qualification'], dtype='object')
<class 'pandas.core.indexes.base.Index'>


In [11]:
# find the index of column name
idx = df.columns.get_loc('Address')
print(idx)

2


In [12]:
print(df.describe())

             Age
count   5.000000
mean   27.800000
std     3.701351
min    22.000000
25%    27.000000
50%    29.000000
75%    29.000000
max    32.000000


In [13]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           5 non-null      object
 1   Age            5 non-null      int64 
 2   Address        5 non-null      object
 3   Qualification  5 non-null      object
dtypes: int64(1), object(3)
memory usage: 288.0+ bytes
None


<a id="csv"></a>
## Exporting to csv, Importing from csv
to_csv will create a csv file from dataframe <br>
read_csv returns a dataframe given a csv file

In [14]:
# to create a csv from a dataframe
df.to_csv("register1.csv")
# read from a csv file
my_data = pd.read_csv("register1.csv")
print(type(my_data))
print(my_data)

<class 'pandas.core.frame.DataFrame'>
   Unnamed: 0    Name  Age    Address Qualification
0           0     Jai   27      Delhi           Msc
1           1  Princi   29     Kanpur            MA
2           2  Gaurav   22  Allahabad           MCA
3           3    Anuj   32    Kannauj           Phd
4           4    Ekta   29     Panaji         BTech


In [15]:
df.to_csv("register2.csv", index=False)
my_data = pd.read_csv("register2.csv")
print(my_data)

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
4    Ekta   29     Panaji         BTech


In [16]:
df.to_csv("register3.csv", index=False, columns=['Name','Age'])
my_data = pd.read_csv("register3.csv")
print(my_data)

     Name  Age
0     Jai   27
1  Princi   29
2  Gaurav   22
3    Anuj   32
4    Ekta   29


<a id="iloc"></a>
### DataFrame.iloc
Purely integer-location based indexing for selection by position.
Allowed inputs are:
- An integer, e.g. 5.
- A list or array of integers, e.g. [4, 3, 0].
- A slice object with ints, e.g. 1:7.
- A boolean array.
- A callable function

In [17]:
print(df)

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
4    Ekta   29     Panaji         BTech


#### Indexing just the rows

In [18]:
# with an interger
print(df.iloc[0])
print(f"type is {type(df.iloc[0])}")

Name               Jai
Age                 27
Address          Delhi
Qualification      Msc
Name: 0, dtype: object
type is <class 'pandas.core.series.Series'>


In [19]:
print(df.iloc[2])

Name                Gaurav
Age                     22
Address          Allahabad
Qualification          MCA
Name: 2, dtype: object


In [20]:
print(df.iloc[[2]])
print(f"type is {type(df.iloc[[2]])}")

     Name  Age    Address Qualification
2  Gaurav   22  Allahabad           MCA
type is <class 'pandas.core.frame.DataFrame'>


In [21]:
# With a list of integers.
print(df.iloc[[0, 2]])

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
2  Gaurav   22  Allahabad           MCA


In [22]:
# With a slice object.
print(df.iloc[1:4])
print(type(df.iloc[1:4]))

     Name  Age    Address Qualification
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
<class 'pandas.core.frame.DataFrame'>


In [23]:
# With a boolean mask the same length as the index.
print(df.iloc[[True, False, False, True, False]])

   Name  Age  Address Qualification
0   Jai   27    Delhi           Msc
3  Anuj   32  Kannauj           Phd


In [24]:
# With a callable function. This selects the rows whose index is even.
print(df.iloc[lambda x: x.index % 2 == 0])

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
2  Gaurav   22  Allahabad           MCA
4    Ekta   29     Panaji         BTech


#### Indexing both axes

In [25]:
# with integers : 2nd row and 1st column
print(df.iloc[2,1])
print(df.iloc[2][1])
print(type(df.iloc[2,1]))
print(type(df.iloc[2][1]))

22
22
<class 'numpy.int64'>
<class 'numpy.int64'>


In [26]:
# With lists of integers : 0th and 4th rows, 1st and second columns
print(df.iloc[[0, 4], [1, 2]])
print(type(df.iloc[[0, 4], [1, 2]]))

   Age Address
0   27   Delhi
4   29  Panaji
<class 'pandas.core.frame.DataFrame'>


Also for either of the axes, you can use slicing or boolean list or callable function

<a id="sort"></a>
### Sort a dataframe based on columns

In [27]:
print(df)

     Name  Age    Address Qualification
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
2  Gaurav   22  Allahabad           MCA
3    Anuj   32    Kannauj           Phd
4    Ekta   29     Panaji         BTech


In [28]:
print(df.sort_values(by=['Age'],ascending=True))

     Name  Age    Address Qualification
2  Gaurav   22  Allahabad           MCA
0     Jai   27      Delhi           Msc
1  Princi   29     Kanpur            MA
4    Ekta   29     Panaji         BTech
3    Anuj   32    Kannauj           Phd


In [29]:
# sort by age first and then sort by name
print(df.sort_values(by=['Age','Name'],ascending=True))

     Name  Age    Address Qualification
2  Gaurav   22  Allahabad           MCA
0     Jai   27      Delhi           Msc
4    Ekta   29     Panaji         BTech
1  Princi   29     Kanpur            MA
3    Anuj   32    Kannauj           Phd


<a id="numpy"></a>
### Conversion of dataframe to numpy array

In [30]:
data_array = df.values
print(type(data_array))
print(data_array.shape)

<class 'numpy.ndarray'>
(5, 4)


In [31]:
print(data_array)

[['Jai' 27 'Delhi' 'Msc']
 ['Princi' 29 'Kanpur' 'MA']
 ['Gaurav' 22 'Allahabad' 'MCA']
 ['Anuj' 32 'Kannauj' 'Phd']
 ['Ekta' 29 'Panaji' 'BTech']]


<a href='#top'>Go Back</a>