# Introduction to Pandas

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for executing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

## Installing and Using Pandas

Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built.
Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).

Once Pandas is installed, you can import it and check the version:

In [1]:
import pandas as pd
pd.__version__


'1.5.3'

# Introducing Pandas Objects

At the very basic level, Pandas objects can be thought as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provide a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.

We will start our code sessions with the standard NumPy and Pandas imports:

In [2]:
import pandas as pd
import numpy as np

### Series

A **Series** is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector.

In [3]:
series=pd.Series([10,30,60,376,283])
series

0     10
1     30
2     60
3    376
4    283
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [6]:
s=pd.Series([10,30,60,376,283],index=['m','n','o','p','q'])

In [10]:
print(s)
print(s.values)
s.index

m     10
n     30
o     60
p    376
q    283
dtype: int64
[ 10  30  60 376 283]


Index(['m', 'n', 'o', 'p', 'q'], dtype='object')

We can assign meaningful labels to the index, if they are available:

In [12]:
marks=pd.Series([90,64,83,55,69],index=['Maths','Chemistry','Physics','Computer Science','English'])
marks

Maths               90
Chemistry           64
Physics             83
Computer Science    55
English             69
dtype: int64

These labels can be used to refer to the values in the `Series`.

In [13]:
marks['English']

69

Notice that the indexing operation preserve the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [14]:
marks[3]

55

We can give both the array of values and the index meaningful labels themselves:

In [15]:
marks.name='MarksCard'
marks.index.name='PUC'
marks

PUC
Maths               90
Chemistry           64
Physics             83
Computer Science    55
English             69
Name: MarksCard, dtype: int64

NumPy's math functions and other operations can be applied to Series without losing the data structure.

In [16]:
np.log(marks)

PUC
Maths               4.499810
Chemistry           4.158883
Physics             4.418841
Computer Science    4.007333
English             4.234107
Name: MarksCard, dtype: float64

We can also filter according to the values in the `Series`:

In [18]:
marks[marks>=70]

PUC
Maths      90
Physics    83
Name: MarksCard, dtype: int64

A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [21]:
info_dict={'name':'Rahul','age':21,'lastname':'Sharma','Pno':8473939}
print(info_dict)
info=pd.Series(info_dict)
info

{'name': 'Rahul', 'age': 21, 'lastname': 'Sharma', 'Pno': 8473939}


name          Rahul
age              21
lastname     Sharma
Pno         8473939
dtype: object

### ``Series`` as generalized NumPy array

From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.

This explicit index definition gives the ``Series`` object additional capabilities.

### Series as specialized dictionary

In this way, you can think of a Pandas ``Series`` a bit like a specialization of  Python dictionary.
A dictionary is a structure that maps arbitrary keys to set of arbitrary values, and ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.

The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from Python dictionary:

## DataFrame: bi-dimensional Series with two (or more) indices

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both row and column index; it can be thought of as a dict of Series (all sharing the same index).

DataFrame's can be created in different ways. One of them is from dict's.

In [30]:
details={'name':['Rahul','Virat','Sandeep','Veeru','Kiran'],'Age':[21,38,26,25,35],'percentage':[94,90.88,85.9,89,84]}
print(details)
info=pd.DataFrame(details)
info

{'name': ['Rahul', 'Virat', 'Sandeep', 'Veeru', 'Kiran'], 'Age': [21, 38, 26, 25, 35], 'percentage': [94, 90.88, 85.9, 89, 84]}


Unnamed: 0,name,Age,percentage
0,Rahul,21,94.0
1,Virat,38,90.88
2,Sandeep,26,85.9
3,Veeru,25,89.0
4,Kiran,35,84.0


To change the order of the columns:

In [31]:
info=pd.DataFrame(details,columns=['Age','percentage','name'])
info

Unnamed: 0,Age,percentage,name
0,21,94.0,Rahul
1,38,90.88,Virat
2,26,85.9,Sandeep
3,25,89.0,Veeru
4,35,84.0,Kiran


An `index` can be passed (as with Series), and passing column names not existing, will result in missing data.

Assigning values to new columns is easy

In [34]:
info['ratio']=info.Age/info.percentage
info

Unnamed: 0,Age,percentage,name,ratio
0,21,94.0,Rahul,0.223404
1,38,90.88,Virat,0.418134
2,26,85.9,Sandeep,0.302678
3,25,89.0,Veeru,0.280899
4,35,84.0,Kiran,0.416667


In [38]:
info['RollNumber']=pd.Series(range(1001,1006),index=[0,1,2,3,4])
info

Unnamed: 0,Age,percentage,name,ratio,RollNumber
0,21,94.0,Rahul,0.223404,1001
1,38,90.88,Virat,0.418134,1002
2,26,85.9,Sandeep,0.302678,1003
3,25,89.0,Veeru,0.280899,1004
4,35,84.0,Kiran,0.416667,1005


In [110]:
info['Year']=pd.Series(np.arange(2016,2025.2),dtype=int)
info

Unnamed: 0,Age,percentage,name,ratio,RollNumber,Year
0,21,94.0,Rahul,0.223404,1001,2016
1,38,90.88,Virat,0.418134,1002,2017
2,26,85.9,Sandeep,0.302678,1003,2018
3,25,89.0,Veeru,0.280899,1004,2019
4,35,84.0,Kiran,0.416667,1005,2020


Passing a dicts where the values are dicts is also possible

In [42]:
info.to_dict()

{'Age': {0: 21, 1: 38, 2: 26, 3: 25, 4: 35},
 'percentage': {0: 94.0, 1: 90.88, 2: 85.9, 3: 89.0, 4: 84.0},
 'name': {0: 'Rahul', 1: 'Virat', 2: 'Sandeep', 3: 'Veeru', 4: 'Kiran'},
 'ratio': {0: 0.22340425531914893,
  1: 0.41813380281690143,
  2: 0.3026775320139697,
  3: 0.2808988764044944,
  4: 0.4166666666666667},
 'RollNumber': {0: 1001, 1: 1002, 2: 1003, 3: 1004, 4: 1005},
 'Year': {0: 2016, 1: 2017, 2: 2018, 3: 2019, 4: 2020}}

In [45]:
info=pd.DataFrame(info.to_dict())
info

Unnamed: 0,Age,percentage,name,ratio,RollNumber,Year
0,21,94.0,Rahul,0.223404,1001,2016
1,38,90.88,Virat,0.418134,1002,2017
2,26,85.9,Sandeep,0.302678,1003,2018
3,25,89.0,Veeru,0.280899,1004,2019
4,35,84.0,Kiran,0.416667,1005,2020


### DataFrame as specialized dictionary

Similarly, we can also think of a ``DataFrame`` as a specialization of dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:

#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:

In [49]:
number_list=[{'age':i*4,'marks':i*9}for i in range(5,10)]
print(number_list)
num=pd.DataFrame(number_list)
num

[{'age': 20, 'marks': 45}, {'age': 24, 'marks': 54}, {'age': 28, 'marks': 63}, {'age': 32, 'marks': 72}, {'age': 36, 'marks': 81}]


Unnamed: 0,age,marks
0,20,45
1,24,54
2,28,63
3,32,72
4,36,81


Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:

In [51]:
pd.DataFrame([{'name':'Rahul','age':22},{'name':'Virat','percentage':86}])

Unnamed: 0,name,age,percentage
0,Rahul,22.0,
1,Virat,,86.0


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:

In [55]:
pd.DataFrame(np.random.randint(4,20),index=['x','y','z'],columns=['column1','column2','column3','column4','column5'])


Unnamed: 0,column1,column2,column3,column4,column5
x,18,18,18,18,18
y,18,18,18,18,18
z,18,18,18,18,18


## The Pandas Index Object

We have seen here that both the ``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you take reference and modify data.
This ``Index`` object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values).
Those views have some interesting consequences in the operations available on ``Index`` objects.
As a simple example, let's construct an ``Index`` from a list of integers:

In [58]:
num_index=pd.Index([100,201,303,474,263,383])
num_index

Int64Index([100, 201, 303, 474, 263, 383], dtype='int64')

### Index as immutable array

The ``Index`` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:

In [59]:
num_index[3]

474

In [62]:
print(num_index[:])
num_index[0:5:3]

Int64Index([100, 201, 303, 474, 263, 383], dtype='int64')


Int64Index([100, 474], dtype='int64')

``Index`` objects also have many of the attributes familiar from NumPy arrays:

In [64]:
print(num_index.dtype,num_index.shape,num_index.size,num_index.ndim)

int64 (6,) 6 1


One difference between ``Index`` objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via normal means:

In [65]:
num_index[4]=200

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.

# Operating on Data in Pandas

One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).
Pandas inherits much of this functionality from NumPy, and the ufuncs are key to this.

Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.
This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof with Pandas.
We will additionally see the well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures.

## Ufuncs: Index Preservation

Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:

In [73]:
randoms=np.random.RandomState(20)
rand=pd.Series(randoms.randint(0,15,4))
rand

0     3
1    10
2    12
3    10
dtype: int32

In [77]:
rmd=pd.DataFrame(randoms.randint(0,15,(6,5)),columns=['m','n','o','p','q'])
rmd

Unnamed: 0,m,n,o,p,q
0,2,1,8,2,10
1,12,4,11,14,4
2,8,6,0,3,4
3,0,13,5,10,6
4,9,12,12,7,4
5,0,12,14,6,1


If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*

In [79]:
np.exp(rand)

0        20.085537
1     22026.465795
2    162754.791419
3     22026.465795
dtype: float64

Or, for a slightly more complex calculation:

In [80]:
np.sin(rand*np.pi/2)

0   -1.000000e+00
1    6.123234e-16
2   -7.347881e-16
3    6.123234e-16
dtype: float64

## Universal Functions: Index Alignment

For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.
This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.

### Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [89]:
markscard1=pd.Series({'Chemistry':89,'Maths':95,'Physics':95,'English':86,'Kannada':99,'Sankrit':80,})
markscard2=pd.Series({'computer_science':99,'Chemistry':82,'Social_science':84,'Maths':79,'Hindi':100,'Physics':98})
print(markscard1)
markscard2


Chemistry    89
Maths        95
Physics      95
English      86
Kannada      99
Sankrit      80
dtype: int64


computer_science     99
Chemistry            82
Social_science       84
Maths                79
Hindi               100
Physics              98
dtype: int64

Let's see what happens when we divide these to compute the population density:

In [86]:
print(markscard1/markscard2)
markscard2/markscard1

Chemistry           1.085366
English                  NaN
Hindi                    NaN
Kannada                  NaN
Maths               1.202532
Physics             0.969388
Sankrit                  NaN
Social_science           NaN
computer_science         NaN
dtype: float64


Chemistry           0.921348
English                  NaN
Hindi                    NaN
Kannada                  NaN
Maths               0.831579
Physics             1.031579
Sankrit                  NaN
Social_science           NaN
computer_science         NaN
dtype: float64

The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [90]:
markscard2.index |markscard1.index

  markscard2.index |markscard1.index


Index(['Chemistry', 'English', 'Hindi', 'Kannada', 'Maths', 'Physics',
       'Sankrit', 'Social_science', 'computer_science'],
      dtype='object')

In [93]:
abc=pd.Series([90,80,70,60,50],index=['m','n','o','p','q'])
xyz=pd.Series([80,60,50,90,70],index=['p','t','r','m','n'])
print(abc,xyz)
abc*xyz

m    90
n    80
o    70
p    60
q    50
dtype: int64 p    80
t    60
r    50
m    90
n    70
dtype: int64


m    8100.0
n    5600.0
o       NaN
p    4800.0
q       NaN
r       NaN
t       NaN
dtype: float64

If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

In [97]:
xyz.add(abc,fill_value=10)

m    180.0
n    150.0
o     80.0
p    140.0
q     60.0
r     60.0
t     70.0
dtype: float64

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## Ufuncs: Operations Between DataFrame and Series

When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:

# Data wrangling
Getting the data in the shape that we want is the single most time consuming task in the life of the Data Scientist. Sometimes it can be the most frustrating. 

## Merge operations
By merging we mean combining different data sets by linking rows with one or more keys. The basic syntax is very simple.

In [131]:
mark1=pd.DataFrame({'Name':['Rahul','Virat','Sandeep','Rohit'],'Age':['29','39','26','19'],'Year':[2018,2020,2022,2024],
                    'Rollnum':[1001,1002,1003,1004],'Percentage':[91,89,79,69]})
mark2=pd.DataFrame({"Subject":["chemistry","maths","Kannada","hindi",],"Age":['29','26','19','39']})
mark1

Unnamed: 0,Name,Age,Year,Rollnum,Percentage
0,Rahul,29,2018,1001,91
1,Virat,39,2020,1002,89
2,Sandeep,26,2022,1003,79
3,Rohit,19,2024,1004,69


In [132]:
mark2

Unnamed: 0,Subject,Age
0,chemistry,29
1,maths,26
2,Kannada,19
3,hindi,39


Let's say we want a dataset with year, literacy, province and population. We can create it from `df` and `df2`.

In [133]:
  # merge is smart! If there are overlapping names, it uses those for the merge
mark1.merge(mark2)

Unnamed: 0,Name,Age,Year,Rollnum,Percentage,Subject
0,Rahul,29,2018,1001,91,chemistry
1,Virat,39,2020,1002,89,hindi
2,Sandeep,26,2022,1003,79,maths
3,Rohit,19,2024,1004,69,Kannada


If the column names are different, you need to specify them explicitely

In [134]:
mark3=pd.DataFrame({"Subject":["chemistry","maths","Kannada"],"Age":['29','26','19']})
mark3

Unnamed: 0,Subject,Age
0,chemistry,29
1,maths,26
2,Kannada,19


In [135]:
mark1.merge(mark3,right_on="Age",left_on="Age")

Unnamed: 0,Name,Age,Year,Rollnum,Percentage,Subject
0,Rahul,29,2018,1001,91,chemistry
1,Sandeep,26,2022,1003,79,maths
2,Rohit,19,2024,1004,69,Kannada


What happened? Zuid Holland is weg!

By default `merge` does inner joins. If you want a different type of join, you can specify it.

In [137]:
mark4=pd.DataFrame({"Subject":["chemistry","maths","Kannada","computer"],"Age":['29','26','19','20']})
print(mark4)
mark1.merge(mark4,how="outer")

     Subject Age
0  chemistry  29
1      maths  26
2    Kannada  19
3   computer  20


Unnamed: 0,Name,Age,Year,Rollnum,Percentage,Subject
0,Rahul,29,2018.0,1001.0,91.0,chemistry
1,Virat,39,2020.0,1002.0,89.0,
2,Sandeep,26,2022.0,1003.0,79.0,maths
3,Rohit,19,2024.0,1004.0,69.0,Kannada
4,,20,,,,computer


Check this out:

In [138]:
mark5=pd.DataFrame({"Subject":["chemistry","maths","Kannada","science"],"Age":['29','26','19','20']})
print(mark5)
mark1.merge(mark5,how="outer")

     Subject Age
0  chemistry  29
1      maths  26
2    Kannada  19
3    science  20


Unnamed: 0,Name,Age,Year,Rollnum,Percentage,Subject
0,Rahul,29,2018.0,1001.0,91.0,chemistry
1,Virat,39,2020.0,1002.0,89.0,
2,Sandeep,26,2022.0,1003.0,79.0,maths
3,Rohit,19,2024.0,1004.0,69.0,Kannada
4,,20,,,,science


This was a many-to-many merge. Even though if you think about it, the behavior is what you expect, you might still not think about it and be surprised!

### Combining data with overlap
Sometimes some data is missing, and it can be "patched" with another dataset. Let's take a look.

In [147]:
group_A=pd.Series([np.nan,20.0,np.nan,42.48,np.nan,84.36,np.nan,47.83],index=['a','b','c','d','e','f','g','h'])
group_B=pd.Series(np.arange(len(group_A)),dtype=np.float32,index=['a','b','c','d','e','f','g','h'])                                                                        

In [149]:
group_A

a      NaN
b    20.00
c      NaN
d    42.48
e      NaN
f    84.36
g      NaN
h    47.83
dtype: float64

In [150]:
group_B

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
g    6.0
h    7.0
dtype: float32

Let's say we want to update `a` with the values from `b`. The num-pythonic way to do that is

In [155]:
pd.Series(np.where(pd.isnull(group_A),group_B,group_A),index=group_A)

NaN       0.00
20.00    20.00
NaN       2.00
42.48    42.48
NaN       4.00
84.36    84.36
NaN       6.00
47.83    47.83
dtype: float64

That's a bit verbose for something so simple. What about this:

In [152]:
group_A.combine_first(group_B)

a     0.00
b    20.00
c     2.00
d    42.48
e     4.00
f    84.36
g     6.00
h    47.83
dtype: float64