# Pandas: data analysis in python

**What is pandas?**
 
Pandas can be thought of as NumPy arrays with labels for rows and columns, and better support for heterogeneous data types, but it's also much, much more than that.

Powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, ...


**When do you need pandas?**

When working with tabular or structured data (SQL table, Excel spreadsheet, ...):

Import data

Clean up messy data

Explore data, gain insight into data

Process and prepare your data for analysis

Analyse your data (together with scikit-learn, statsmodels, ...)

# The pandas data structures: DataFrame and Series

Pandas introduces two new data structures to Python, both of which are built on top of NumPy (this means it's fast) :

**Series** : one-dimensional object akin to an observation/row in a dataset

**DataFrame** : tabular data structure akin to a database table

# 1. Series

A Series is a one-dimensional object similar to an array, list, or column in a table.

It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N-1, where N is the length of the Series.

**1.1 Creating **

In [1]:
import pandas as pd
import numpy as np

In [2]:
# create a Series with an arbitrary list
s = pd.Series([7, 'satish', 3.14, -1789710578, 'Happy Coding!'])
s

0                7
1           satish
2             3.14
3      -1789710578
4    Happy Coding!
dtype: object

Alternatively, you can specify an index to use when creating the Series.

In [3]:
s = pd.Series([7, 'satish', 3.14, -1789710578, 'Happy Coding!'],
              index=['A', 'B', 'C', 'D', 'E'])
s

A                7
B           satish
C             3.14
D      -1789710578
E    Happy Coding!
dtype: object

The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [4]:
d = {'Delhi': 1000, 'Bombay': 1300, 'Hyderabad': 900, 'Chennai': 1100,
     'Bengaluru': 450, 'Guntur': None}
cities = pd.Series(d)
cities

Bengaluru     450.0
Bombay       1300.0
Chennai      1100.0
Delhi        1000.0
Guntur          NaN
Hyderabad     900.0
dtype: float64

**1.2 Selecting**

You can use the index to select specific items from the Series ...

In [5]:
cities['Hyderabad']

900.0

In [6]:
cities[['Hyderabad', 'Delhi', 'Chennai']]

Hyderabad     900.0
Delhi        1000.0
Chennai      1100.0
dtype: float64

you can use boolean indexing for selection

In [7]:
cities[cities < 1000]

Bengaluru    450.0
Hyderabad    900.0
dtype: float64

so let's make it more clear - cities < 1000 returns a Series of True/False values, which we then pass to our Series cities, returning the corresponding True items.

In [8]:
print(cities < 1000)
print('\n')
print(cities[cities < 1000])

Bengaluru     True
Bombay       False
Chennai      False
Delhi        False
Guntur       False
Hyderabad     True
dtype: bool


Bengaluru    450.0
Hyderabad    900.0
dtype: float64


**1.3 Editing**

In [9]:
# changing based on the index
print('Old value:', cities['Hyderabad'])
cities['Hyderabad'] = 1400
print('New value:', cities['Hyderabad'])

('Old value:', 900.0)
('New value:', 1400.0)


In [10]:
# changing values using boolean logic
print(cities[cities < 1000])
print('\n')

cities[cities < 1000] = 750

print(cities[cities < 1000])

Bengaluru    450.0
dtype: float64


Bengaluru    750.0
dtype: float64


**1.4 Mathematical Operations**

Mathematical operations can be done using scalars and functions.

In [11]:
# divide city values by 4
cities / 4

Bengaluru    187.5
Bombay       325.0
Chennai      275.0
Delhi        250.0
Guntur         NaN
Hyderabad    350.0
dtype: float64

In [12]:
# square city values
np.square(cities)

Bengaluru     562500.0
Bombay       1690000.0
Chennai      1210000.0
Delhi        1000000.0
Guntur             NaN
Hyderabad    1960000.0
dtype: float64

You can add two Series together, which returns a union of the two Series with the addition occurring on the shared index values. Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

In [13]:
print(cities[['Bengaluru', 'Bombay', 'Chennai']])
print('\n')
print(cities[['Bombay', 'Hyderabad']])
print('\n')
print(cities[['Bengaluru', 'Bombay', 'Chennai']] + cities[['Bombay', 'Hyderabad']])

Bengaluru     750.0
Bombay       1300.0
Chennai      1100.0
dtype: float64


Bombay       1300.0
Hyderabad    1400.0
dtype: float64


Bengaluru       NaN
Bombay       2600.0
Chennai         NaN
Hyderabad       NaN
dtype: float64


**1.5 Missing Values**

check an item / element present in series or not?

In [14]:
print('Vizag' in cities)
print('Hyderabad' in cities)

False
True


NULL checking can be performed with isnull and notnull.

In [15]:
# returns a boolean series indicating which values aren't NULL
cities.notnull()

Bengaluru     True
Bombay        True
Chennai       True
Delhi         True
Guntur       False
Hyderabad     True
dtype: bool

In [16]:
# use boolean logic to find the NULL cities
print(cities.isnull())
print('\n')
print(cities[cities.isnull()])

Bengaluru    False
Bombay       False
Chennai      False
Delhi        False
Guntur        True
Hyderabad    False
dtype: bool


Guntur   NaN
dtype: float64
