#Python Data Analysis Library

---

**What is Pandas or Python Data Analysis Library?**


pandas is an open source, Python Data Analysis Library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

The name Pandas is drived from "Panel Data" - an Econometrics from Multidimensional data.

---

**Library Highlights**

*   A fast and efficient **DataFrame** object for data manipulation with integrated indexing;
*   Tools for **reading and writing** data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
*   Intelligent **data alignment** and integrated handling of **missing data**: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
*   Flexible **reshaping** and pivoting of data sets;
*   Intelligent label-based **slicing, fancy indexing,** and **subsetting** of large data sets;
*   Columns can be inserted and deleted from data structures for **size mutability**;
*   Aggregating or transforming data with a powerful **group by** engine allowing split-apply-combine operations on data sets;
*   High performance **merging and joining** of data sets;
*   **Hierarchical axis indexing** provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
*   **Time series**-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
*   Highly **optimized for performance**, with critical code paths written in Cython or C.
*   Python with pandas is in use in a wide variety of **academic and commercial** domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

---

**Pandas is well suited for many different kind of data:**
*   Tabular data with hetrogeneously-typed columns.
*   Ordered and unordered time series data.
*   Arbitary matrix data with row and column lebels
*   Any other form of observational/Statistical data set. The data actually need not be labeled all to be placed into Pandas data structure.

---




**Data Structures in Pandas:**

Pandas provides three data structures: **Series, Data Frame and Panel**; all of which are built on top of the NumPy array.


---



# Series

---
**Series** is one-dimensional labeled array structures that stores homogeneous data i.e., data of single type. 

All the elements of series are **value-mutable** and **size-immutable**

---

In other words, **Series** is a **one-dimensional labeled array** capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

**s = pd.Series(data, index=index)**

---




Here, data can be many different things:

*   a Python dict
*   an ndarray
*   a scalar value (like 5)

---

The passed index is a list of axis labels. Thus, this separates into a few cases depending on **what data is:**

---



**Create a Series from ndarray**

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1]. or [0,...., n-1]

---

Example1: 

In [0]:
import pandas as pd
import numpy as np

arr = np.array([10,20,30,40,50])

s = pd.Series(arr)

print(s)

0    10
1    20
2    30
3    40
4    50
dtype: int64


Example2:

---



In [0]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    1.830376
b    0.456866
c    1.160127
d   -0.008266
e    0.754565
dtype: float64


In [0]:
print(s.index)

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


**Creating a Series from a Python dictionary (dict):**

---

Example1:

In [0]:
import pandas as pd

data = {'a':10, 'b':20, 'c':30}
s = pd.Series(data)
print(s)

a    10
b    20
c    30
dtype: int64


In [0]:
import pandas
dict = {'a': 0., 'b': 1., 'c': 2.}
print(dict)

s = pandas.Series(dict, index=['b', 'c', 'd', 'a'])
print(s)

{'a': 0.0, 'b': 1.0, 'c': 2.0}
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


**Note:** NaN (not a number) is the standard missing data marker used in pandas.

**Creating Series from scalar value**

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

---

Example1:

In [0]:
s= pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64


**Accessing Data from a Series:**

---

**Slicing:** Retrieving a part of the series using slicing.

Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations such as slicing will also **slice the index**.

--- 

Example:


In [0]:
import pandas as pd
import numpy as np

arr = ([10,20,30,40,50])
s = pd.Series(arr)

print(s)

# slicing or slice of index

print(s[1])
print(s[0:5])
print(s[::-1]) #Reverse the series
print([[2,4,5]])

0    10
1    20
2    30
3    40
4    50
dtype: int64
20
0    10
1    20
2    30
3    40
4    50
dtype: int64
4    50
3    40
2    30
1    20
0    10
dtype: int64
[[2, 4, 5]]


Access Series is **dict-like**


---


A Series is like a fixed-size dict in that you can get and set values by index label:

In [0]:
import pandas as pd

data = {'a':10, 'b':20, 'c':30}
s = pd.Series(data)
print(s)

print(s['a']) #Accessing index value assign key 


a    10
b    20
c    30
dtype: int64
10


**Vectorized operations and label alignment with Series**

In [0]:
print (s + s) #Addition of two series

print(s*2) #Multiplication of two series

print(np.exp(s)) #Exponential value 



a    20
b    40
c    60
dtype: int64
a    20
b    40
c    60
dtype: int64
a    2.202647e+04
b    4.851652e+08
c    1.068647e+13
dtype: float64


**Name attribute**

Series can also have a name attribute:


---

Example:


In [0]:
s = pd.Series(np.random.randn(5), name='something')
print(s)

0    2.967709
1    0.073752
2   -0.639722
3    0.027365
4   -0.301899
Name: something, dtype: float64


You can rename a Series with the pandas.Series.rename() method.

In [0]:
s2 = s.rename("different")
print(s2)

0    2.967709
1    0.073752
2   -0.639722
3    0.027365
4   -0.301899
Name: different, dtype: float64


# Data Frame

**A Data Frame** is a **2D data structure** in which data is aligned in a tabular fashion consisting of **rows and columns**

A Data Frame can be created using the following constructor:

df = pandas.DataFrame(data, index, dtype, copy)

---

DataFrame accepts many different kinds of input:

*   Dict of 1D ndarrays, lists, dicts, or Series
*   2-D numpy.ndarray
*   Structured or record ndarray
*   A Series
*   Another DataFrame

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

---



**Creating a Data Frame using List**

---

Converting **list** into** Data Frame**

In [0]:
import pandas

listx = [10, 20, 30, 40, 50]

table = pandas.DataFrame(listx)

print (table)

    0
0  10
1  20
2  30
3  40
4  50


**Creating a Data Frame from a list of dictionary**

---



In [0]:
import pandas
data_list = [{'a':10, 'b':20},{'a':20, 'b':30,'c':40}]
table = pandas.DataFrame(data_list, index = ['first','second'])
print(table)

         a   b     c
first   10  20   NaN
second  20  30  40.0


**Converting a dictionary of series into a Data Frame:**

In [0]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pandas.DataFrame(d)
print(df)
print("---------------------------------------")

# Access Column 

df2 = pd.DataFrame(d, index=['d', 'b', 'a'])

print(df2)
print("---------------------------------------")
# Access index (Row) and column

df = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
print(df)

#Note: The row and column labels can be accessed respectively by accessing the index and columns attributes.

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
---------------------------------------
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0
---------------------------------------
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN


**Note:** When a particular set of columns is passed along with a dict of data, the passed columns override the keys in the dict.

---



**Creating Data Frame From a list of dicts:**

In [0]:
import pandas as pd

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
table = pd.DataFrame(data2)
print(table)

   a   b     c
0  1   2   NaN
1  5  10  20.0


**Column selection, addition, deletion**


---

Add New Column in existing Data Frame

In [0]:
import pandas as pd

d = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
table = pandas.DataFrame(d)

#Adding new column

table["three"] = pd.Series([1, 2, 3], index = ['a', 'b', 'c'])

print(table)

print("---------------------------------------------")
#Adding new column with boolen values

table['flag'] = table["three"]>2

print(table)

   one  two  three
a  1.0  1.0    1.0
b  2.0  2.0    2.0
c  3.0  3.0    3.0
d  NaN  4.0    NaN
---------------------------------------------
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    2.0  False
c  3.0  3.0    3.0   True
d  NaN  4.0    NaN  False


**Columns can be deleted or popped like with a dict:**

---

Data Frame column can be deleted using  the **del () function:**

In [0]:
del table['two']
print(table)

   one  three   flag
a  1.0    1.0  False
b  2.0    2.0  False
c  3.0    3.0   True
d  NaN    NaN  False


Data Frame Column can be deleted using the **pop() function**:

---

**pop()** methond is an inbuilt function in Python that removes and returns last value from the list or the given index value.

In [0]:
three = table.pop('three')

print(three)

print(table)

a    1.0
b    2.0
c    3.0
d    NaN
Name: three, dtype: float64
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False


When inserting **a scalar value**, it will naturally be propagated to fill the column:

---



In [0]:
table["foo"] = "bar"
print(table)

   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar


**Indexing / Selection**

---

**The basics of indexing are as follows:**


---

**Operation	      -->    Syntax	       -->         Result**

---

Select column	    -->** df[col]**	--> Series

Select row by label	--> **df.loc[label]**	--> Series

Select row by integer location -->	**df.iloc[loc]**	--> Series

Slice rows	--> **df[5:10]	**--> DataFrame

Select rows by boolean vector	--> **df[bool_vec]**	--> DataFrame

---

Example : Row selection, for example, returns a Series whose index is the columns of the DataFrame:

In [0]:
table.loc['a']

one         1
flag    False
foo       bar
Name: a, dtype: object

In [0]:
table.iloc[2]

one        3
flag    True
foo      bar
Name: c, dtype: object

**Data Frame - Row Addtion:**

---

The **append() function **can be used to add one or more rows into the Data Frame


In [0]:
print(table)
print("----------------------------------------")
row = pd.DataFrame([[1,'True'],[3,'False']], columns = ['one','flag'])
table1= table.append(row)
print(table1)

   one   flag  foo
a  1.0  False  bar
b  2.0  False  bar
c  3.0   True  bar
d  NaN  False  bar
----------------------------------------
    flag  foo  one
a  False  bar  1.0
b  False  bar  2.0
c   True  bar  3.0
d  False  bar  NaN
0   True  NaN  1.0
1  False  NaN  3.0


**Data Frame - Row Deletion:**

---

The drop() function can be used to drop rows whose labels are provided


In [0]:
table1 = table1.drop('d')
print(table1)

    flag  foo  one
a  False  bar  1.0
b  False  bar  2.0
c   True  bar  3.0
0   True  NaN  1.0
1  False  NaN  3.0


**Data alignment and arithmetic:**

---

Data alignment between DataFrame objects automatically align on both the columns and the index (row labels). Again, the resulting object will have the union of the column and row labels.

In [0]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])

df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])

print(df + df2)

          A         B         C   D
0 -2.565289  1.787026  0.948796 NaN
1  1.137447  2.679988 -0.086896 NaN
2  1.023824  5.008005  0.761682 NaN
3  1.471461  2.219874  1.705523 NaN
4  1.600869 -0.535285 -0.444800 NaN
5 -1.463714  1.335283 -0.062675 NaN
6 -2.608568 -0.023066  0.294771 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN


When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the DataFrame columns, thus broadcasting row-wise. For example:

In [0]:
print(df - df.iloc[0])

          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1  1.820425  1.693241 -1.797476  0.589538
2  0.372968  1.620883  0.026293  0.126685
3  1.096899  0.107548 -0.937305  1.931768
4  1.661470 -1.432305 -0.965407  0.281767
5 -1.263498  1.505922  0.113657 -0.232598
6 -0.539878 -0.337395 -0.174870  2.082512
7  0.665459  0.847926 -0.706562 -0.974827
8  0.158647  1.019152 -0.935310  3.216164
9  2.017132  0.900415 -1.090686  2.248355


**Transposing**

---

To transpose, access the T attribute (also the transpose function), similar to an ndarray:

In [0]:
print(df[:5].T)

          0         1         2         3         4
A -0.414549  1.405876 -0.041582  0.682350  1.246921
B  0.469634  2.162874  2.090517  0.577182 -0.962671
C  0.713379 -1.084097  0.739672 -0.223926 -0.252028
D -0.985655 -0.396116 -0.858970  0.946113 -0.703888


**Loading CSV data into Data Frame:**

---

Data can be loaded into DataFrames from input data stored in the CSV fromat using the **read csv() fucntion**

In [0]:
df = pandas.read_csv(path_to_file)

**Storing Data into CSV File:**

---

Data present in DataFrames can be written to a CSV file using to the ***to_csv() function***


In [0]:
table.to_csv(path_to_file) # if the specified path doesn't exist, a file of the same is automatically created.

**Loading Excel Sheet data in to Pandas's Data Frame**

---

Data can be loaded into DataFrames from input data stored in the Excel Sheet format using **read_excel() function**

In [0]:
sheet = pandas.read_excel(path_to_file)

**Storing Data into Excel File:**

---

Data present in DataFrames can be written to a Excel file using to the ***to_excel() function***

In [0]:
table.to_excel(path_to_file) # if the specified path doesn't exist, a file of the same is automatically created.

#Panel



---

Panel is a somewhat less-used, but still important container for 3-dimensional data. The term panel data is derived from econometrics and is partially responsible for the name pandas: pan(el)-da(ta)-s. The names for the 3 axes are intended to give some semantic meaning to describing operations involving panel data and, in particular, econometric analysis of panel data. However, for the strict purposes of slicing and dicing a collection of DataFrame objects, you may find the axis names slightly arbitrary:

*   items: axis 0, each item corresponds to a DataFrame contained inside
*   major_axis: axis 1, it is the index (rows) of each of the DataFrames
*   minor_axis: axis 2, it is the columns of each of the DataFrames

Construction of Panels works about like you would expect:

In [0]:
wp = pd.Panel(np.random.randn(2, 5, 4), items=['Item1', 'Item2'],
              major_axis=pd.date_range('1/1/2000', periods=5),
              minor_axis=['A', 'B', 'C', 'D'])
print(wp)

print("total dimension:", wp.ndim)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 2000-01-01 00:00:00 to 2000-01-05 00:00:00
Minor_axis axis: A to D
total dimension: 3


#Essential Basic Functionality of Series

---

**Head() **

---

Head() returns the first n rowns from the data structure (Note: by defult it given top 5 rows)

In [0]:
long_series = pd.Series(np.random.randn(1000))
print(long_series.head())

0    0.404458
1   -0.820980
2   -2.103035
3   -0.574866
4   -0.137923
dtype: float64


**Tail()**

---

Tail () returns the last n rowns from the data structure (Note: by defult it given last 5 rows)

In [0]:
print(long_series.tail(3))

997    1.007061
998   -0.926583
999   -1.383625
dtype: float64


**ndim() :**

---

Ndim() fucntion returns the number of dimensions of the data structure


In [0]:
import pandas as pd
import numpy as np

df = pd.Series(np.arange(1,51))

print(df.ndim)

1


**Axes() :**

---

Axes() returns a list of the axes of the row labels or index labels

In [0]:
import pandas as pd
import numpy as np

df = pd.Series(np.arange(1,51))
print(df.axes)

[RangeIndex(start=0, stop=50, step=1)]


#Basic Functionality of DataFrame

---

**Sum()**

---

Sum() returns the sum of the values for the requested axis

In [0]:
import pandas as pd
import numpy as np

d = {'odd':np.arange(1,100,2),
     'even':np.arange(0,100,2)}

print(d['odd'])
print(d['even'])

df = pd.DataFrame(d)

print(df.sum())

[ 1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
 97 99]
[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94
 96 98]
even    2450
odd     2500
dtype: int64


**Std():**

---

std() returns the standard deviation of the values for the requested axis


In [0]:
import pandas as pd
import numpy as np

d = {'odd':np.arange(1,100,2),
     'even':np.arange(0,100,2)}

print(d['odd'])
print(d['even'])

df = pd.DataFrame(d)

print(df.std())

[ 1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
 97 99]
[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94
 96 98]
even    29.154759
odd     29.154759
dtype: float64


**Describe():**

---




In [0]:
import pandas as pd
import numpy as np

d = {'odd':np.arange(1,100,2),
     'even':np.arange(0,100,2)}

# print(d['odd'])
# print(d['even'])

df = pd.DataFrame(d)

print(df.describe())

            even        odd
count  50.000000  50.000000
mean   49.000000  50.000000
std    29.154759  29.154759
min     0.000000   1.000000
25%    24.500000  25.500000
50%    49.000000  50.000000
75%    73.500000  74.500000
max    98.000000  99.000000


**Iterating a DataFrame : Iteritem()**

---

iteritem() iterates over the each column as key, value pair

In [0]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(5,4),
                 columns = ['col1', 'col2', 'col3', 'col4'])

# print(df)

for key, value in df.iteritems():
  print(key, value)

col1 0    0.347366
1    0.103036
2    0.407164
3    0.892615
4    0.814948
Name: col1, dtype: float64
col2 0    0.362091
1    0.116015
2    0.130096
3    0.406870
4    0.669844
Name: col2, dtype: float64
col3 0    0.416794
1    0.345382
2    0.646535
3    0.794729
4    0.511720
Name: col3, dtype: float64
col4 0    0.461166
1    0.039785
2    0.299809
3    0.305677
4    0.626778
Name: col4, dtype: float64


**Iterating a DataFrame - iterrows()**


---
iterrows() iterates over each rows as key, value pair

In [0]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(5,4),
                 columns = ['col1', 'col2', 'col3', 'col4'])

# print(df)

for key, value in df.iterrows():
  print(key, value)

0 col1    0.997227
col2    0.715792
col3    0.755535
col4    0.922561
Name: 0, dtype: float64
1 col1    0.966294
col2    0.464187
col3    0.937452
col4    0.501882
Name: 1, dtype: float64
2 col1    0.523291
col2    0.478921
col3    0.511940
col4    0.755952
Name: 2, dtype: float64
3 col1    0.577980
col2    0.541918
col3    0.115140
col4    0.737888
Name: 3, dtype: float64
4 col1    0.989529
col2    0.909071
col3    0.911384
col4    0.933799
Name: 4, dtype: float64


**Iterating a DataFrame: Itertuples() **

---

itertuples() returns a iterator yielding a named tuple for each row

In [0]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.rand(5,4),
                 columns = ['col1', 'col2', 'col3', 'col4'])

# print(df)

for row in df.itertuples():
  print(row)

Pandas(Index=0, col1=0.3271088319508374, col2=0.8798406326592687, col3=0.5758058157330783, col4=0.4474182686536091)
Pandas(Index=1, col1=0.8866541603650968, col2=0.8486574167062793, col3=0.959083054333319, col4=0.10082891223946777)
Pandas(Index=2, col1=0.8811831767796888, col2=0.28974734093656584, col3=0.5021811277574298, col4=0.6033728519996625)
Pandas(Index=3, col1=0.46900999295038115, col2=0.33052113449281495, col3=0.4328992041414724, col4=0.22658950632432695)
Pandas(Index=4, col1=0.8749013530458443, col2=0.8659441010111678, col3=0.8855544740631585, col4=0.8095008967770386)


# Groupby Operations

---

By “group by” we are referring to a process involving one or more of the following steps:

 
*   **Splitting ** the data into groups based on some criteria.
*   **Applying ** a function to each group independently.
*   **Combining** the results into a data structure.

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to one of the following:

**Aggregation: ** compute a summary statistic (or statistics) for each group. Some examples:

*   Compute group sums or means.
*   Compute group sizes / counts.

**Transformation:** perform some group-specific computations and return a like-indexed object. Some examples:

*   Standardize data (zscore) within a group.
*   Filling NAs within groups with a value derived from each group.

**Filtration:** discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

*   Discard data that belongs to groups with only a few members.
*   Filter out data based on the group sum or mean.

---



**Splitting an object into groups:**

An objects in Pandas can be splits into multiple ones.

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:

In [0]:
lst = [1, 2, 3, 1, 2, 3]

s = pd.Series([1, 2, 3, 10, 20, 30], lst)

print(s)

grouped = s.groupby(level=0)

print(grouped.first())
print(grouped.last())
print(grouped.sum())

1     1
2     2
3     3
1    10
2    20
3    30
dtype: int64
1    1
2    2
3    3
dtype: int64
1    10
2    20
3    30
dtype: int64
1    11
2    22
3    33
dtype: int64


**Groupby Operation on DataFrame:**

---



In [0]:
import pandas as pd

world_cup = {'Team':['West Indies','West Indies','India', 'Australia', 'Pakistan', 'Sri Lanka', 'Australia','Australia','Australia', 'India', 'Australia'],
             'Rank':[7,7,2,1,6,4,1,1,1,2,1],
             'Year':[1975,1979,1983,1987,1992,1996,1999,2003,2007,2011,2015]}

df = pd.DataFrame(world_cup)
print(df)


    Rank         Team  Year
0      7  West Indies  1975
1      7  West Indies  1979
2      2        India  1983
3      1    Australia  1987
4      6     Pakistan  1992
5      4    Sri Lanka  1996
6      1    Australia  1999
7      1    Australia  2003
8      1    Australia  2007
9      2        India  2011
10     1    Australia  2015


**The DataFrame is grouped according to the 'Team' column**

In [0]:
import pandas as pd

world_cup = {'Team':['West Indies','West Indies','India', 'Australia', 'Pakistan', 'Sri Lanka', 'Australia','Australia','Australia', 'India', 'Australia'],
             'Rank':[7,7,2,1,6,4,1,1,1,2,1],
             'Year':[1975,1979,1983,1987,1992,1996,1999,2003,2007,2011,2015]}

df = pd.DataFrame(world_cup)
print(df.groupby('Team').groups)

{'Australia': Int64Index([3, 6, 7, 8, 10], dtype='int64'), 'India': Int64Index([2, 9], dtype='int64'), 'Pakistan': Int64Index([4], dtype='int64'), 'Sri Lanka': Int64Index([5], dtype='int64'), 'West Indies': Int64Index([0, 1], dtype='int64')}


**Grouped by Multiple Column: **

The DataFrame is grouped according to the "Team" and "Icc_rank" column

In [0]:
import pandas as pd

world_cup = {'Team':['West Indies','West Indies','India', 'Australia', 'Pakistan', 'Sri Lanka', 'Australia','Australia','Australia', 'India', 'Australia'],
             'Rank':[7,7,2,1,6,4,1,1,1,2,1],
             'Year':[1975,1979,1983,1987,1992,1996,1999,2003,2007,2011,2015]}

df = pd.DataFrame(world_cup)
print(df.groupby(['Team','Rank']).groups)

{('Australia', 1): Int64Index([3, 6, 7, 8, 10], dtype='int64'), ('India', 2): Int64Index([2, 9], dtype='int64'), ('Pakistan', 6): Int64Index([4], dtype='int64'), ('Sri Lanka', 4): Int64Index([5], dtype='int64'), ('West Indies', 7): Int64Index([0, 1], dtype='int64')}


**Iterating Through Groups:**

Groups can be iterated through just like using itertools

In [0]:
import pandas as pd

world_cup = {'Team':['West Indies','West Indies','India', 'Australia', 'Pakistan', 'Sri Lanka', 'Australia','Australia','Australia', 'India', 'Australia'],
             'Rank':[7,7,2,1,6,4,1,1,1,2,1],
             'Year':[1975,1979,1983,1987,1992,1996,1999,2003,2007,2011,2015]}

df = pd.DataFrame(world_cup)
grouped = df.groupby('Team')

for name, group in grouped:
  print(name)

Australia
India
Pakistan
Sri Lanka
West Indies


**Selecting a Group:**

A single group can be selected using **get_group()**

In [0]:
import pandas as pd

world_cup = {'Team':['West Indies','West Indies','India', 'Australia', 'Pakistan', 'Sri Lanka', 'Australia','Australia','Australia', 'India', 'Australia'],
             'Rank':[7,7,2,1,6,4,1,1,1,2,1],
             'Year':[1975,1979,1983,1987,1992,1996,1999,2003,2007,2011,2015]}

df = pd.DataFrame(world_cup)
gropued = df.groupby('Team')

print(grouped.get_group('India'))

   Rank   Team  Year
2     2  India  1983
9     2  India  2011


**Aggregation:**

---

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data.

An obvious one is aggregation via the aggregate() or equivalently agg() method:




In [0]:
import pandas as pd
import numpy as np

d = {'odd':np.arange(1,100,2), 'even':np.arange(0,100,2)}
print(d['odd'])
print(d['even'])

df = pd.DataFrame(d)
print(df.groupby('odd').groups)

[ 1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
 97 99]
[ 0  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94
 96 98]
{1: Int64Index([0], dtype='int64'), 3: Int64Index([1], dtype='int64'), 5: Int64Index([2], dtype='int64'), 7: Int64Index([3], dtype='int64'), 9: Int64Index([4], dtype='int64'), 11: Int64Index([5], dtype='int64'), 13: Int64Index([6], dtype='int64'), 15: Int64Index([7], dtype='int64'), 17: Int64Index([8], dtype='int64'), 19: Int64Index([9], dtype='int64'), 21: Int64Index([10], dtype='int64'), 23: Int64Index([11], dtype='int64'), 25: Int64Index([12], dtype='int64'), 27: Int64Index([13], dtype='int64'), 29: Int64Index([14], dtype='int64'), 31: Int64Index([15], dtype='int64'), 33: Int64Index([16], dtype='int64'), 35: Int64Index([17], dtype='int64'), 37: Int64Index([18], dtype='int64'), 39:

#Merge & Join

**Merging is the Pandas operation that performs Database join on objects.**

pandas provides a single function, ***merge()***, as the entry point for all standard database join operations between DataFrame or named Series objects:

---
**Brief primer on merge methods (relational algebra)**

Experienced users of relational databases like SQL will be familiar with the terminology used to describe join operations between two SQL-table like structures (DataFrame objects). There are several cases to consider which are very important to understand:

*   **one-to-one joins: ** for example when joining two DataFrame objects on their indexes (which must contain unique values).
*   **many-to-one joins: ** for example when joining an index (unique) to one or more columns in a different DataFrame.
*   **many-to-many joins:** joining columns on columns.

---

**Note: **When joining columns on columns (potentially a many-to-many join), any indexes on the passed DataFrame objects will be discarded.

In [0]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
   

right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
  
result = pd.merge(left, right, on='key')
print(result)

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A1  B1  K1  C1  D1
2  A2  B2  K2  C2  D2
3  A3  B3  K3  C3  D3


**Example2:**

---



In [0]:
import pandas as pd

campian_stats = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
                  'Rank':[2,3,7,8,4],
                  'World_Champ_Year':[2011,2015,1979,1992,1996],
                  'Points':[874,787,753,673,855]}
match_stats = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
               'World_Cup_Played':[11,10,11,9,8],
               'ODIs_Played':[733,988,712,679,662]}

df1 = pd.DataFrame(campian_stats)
df2 = pd.DataFrame(match_stats)

print(df1)
print(df2)

print('-------------------------------------------------------------------------------')
print(pd.merge(df1,df2, on = 'Team'))

   Points  Rank         Team  World_Champ_Year
0     874     2        India              2011
1     787     3    Australia              2015
2     753     7  West Indies              1979
3     673     8     Pakistan              1992
4     855     4    Sri Lanka              1996
   ODIs_Played         Team  World_Cup_Played
0          733        India                11
1          988    Australia                10
2          712  West Indies                11
3          679     Pakistan                 9
4          662    Sri Lanka                 8
-------------------------------------------------------------------------------
   Points  Rank         Team  World_Champ_Year  ODIs_Played  World_Cup_Played
0     874     2        India              2011          733                11
1     787     3    Australia              2015          988                10
2     753     7  West Indies              1979          712                11
3     673     8     Pakistan              1992    

**Left Join :**

---

Left Join merge two object based on the keys from the left object

In [0]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
   

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
   
result = pd.merge(left, right, how='left', on=['key1', 'key2'])
print(result)


    A   B key1 key2    C    D
0  A0  B0   K0   K0   C0   D0
1  A1  B1   K0   K1  NaN  NaN
2  A2  B2   K1   K0   C1   D1
3  A2  B2   K1   K0   C2   D2
4  A3  B3   K2   K1  NaN  NaN


**Exmaple 2: Left Join**

---



In [0]:
import pandas as pd

world_campians = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
                  'Rank':[2,3,7,8,4],
                  'Year':[2011,2015,1979,1992,1996],
                  'Points':[874,787,753,673,855]}
chokers = {'Team':['South Africa','New Zealand', 'Zimbambwe'],
                  'Rank':[1,5,9],
                  'Points':[895,764,656]}

df1 = pd.DataFrame(world_campians)
df2 = pd.DataFrame(chokers)

print(df1)
print(df2)

print('----------------------------------------------------------------')

result = pd.merge(df1,df2,on='Team',how = 'left')
print(result)

   Points  Rank         Team  Year
0     874     2        India  2011
1     787     3    Australia  2015
2     753     7  West Indies  1979
3     673     8     Pakistan  1992
4     855     4    Sri Lanka  1996
   Points  Rank          Team
0     895     1  South Africa
1     764     5   New Zealand
2     656     9     Zimbambwe
----------------------------------------------------------------
   Points_x  Rank_x         Team  Year  Points_y  Rank_y
0       874       2        India  2011       NaN     NaN
1       787       3    Australia  2015       NaN     NaN
2       753       7  West Indies  1979       NaN     NaN
3       673       8     Pakistan  1992       NaN     NaN
4       855       4    Sri Lanka  1996       NaN     NaN


**Right Join:**

---

Right Join merges two objects based on the key from the right object

In [0]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
   

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
   
result = pd.merge(left, right, how='right', on=['key1', 'key2'])
print(result)

     A    B key1 key2   C   D
0   A0   B0   K0   K0  C0  D0
1   A2   B2   K1   K0  C1  D1
2   A2   B2   K1   K0  C2  D2
3  NaN  NaN   K2   K0  C3  D3


**Example2:**

---



In [0]:
import pandas as pd

world_campians = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
                  'Rank':[2,3,7,8,4],
                  'Year':[2011,2015,1979,1992,1996],
                  'Points':[874,787,753,673,855]}
chokers = {'Team':['South Africa','New Zealand', 'Zimbambwe'],
                  'Rank':[1,5,9],
                  'Points':[895,764,656]}

df1 = pd.DataFrame(world_campians)
df2 = pd.DataFrame(chokers)

print(df1)
print(df2)

print('----------------------------------------------------------------')

result = pd.merge(df1,df2,on='Team',how = 'right')
print(result)

   Points  Rank         Team  Year
0     874     2        India  2011
1     787     3    Australia  2015
2     753     7  West Indies  1979
3     673     8     Pakistan  1992
4     855     4    Sri Lanka  1996
   Points  Rank          Team
0     895     1  South Africa
1     764     5   New Zealand
2     656     9     Zimbambwe
----------------------------------------------------------------
   Points_x  Rank_x          Team  Year  Points_y  Rank_y
0       NaN     NaN  South Africa   NaN       895       1
1       NaN     NaN   New Zealand   NaN       764       5
2       NaN     NaN     Zimbambwe   NaN       656       9


**Outer Join :**

---

Outer Join merges two objects based on a full union of the columns of both the objects


In [0]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
   

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
   
result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
print(result)

     A    B key1 key2    C    D
0   A0   B0   K0   K0   C0   D0
1   A1   B1   K0   K1  NaN  NaN
2   A2   B2   K1   K0   C1   D1
3   A2   B2   K1   K0   C2   D2
4   A3   B3   K2   K1  NaN  NaN
5  NaN  NaN   K2   K0   C3   D3


**Example2:**

---



In [0]:
import pandas as pd

world_campians = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
                  'Rank':[2,3,7,8,4],
                  'Year':[2011,2015,1979,1992,1996],
                  'Points':[874,787,753,673,855]}
chokers = {'Team':['South Africa','New Zealand', 'Zimbambwe'],
                  'Rank':[1,5,9],
                  'Points':[895,764,656]}

df1 = pd.DataFrame(world_campians)
df2 = pd.DataFrame(chokers)

print(df1)
print(df2)

print('----------------------------------------------------------------')

result = pd.merge(df1,df2,on='Team',how = 'outer')
print(result)

   Points  Rank         Team  Year
0     874     2        India  2011
1     787     3    Australia  2015
2     753     7  West Indies  1979
3     673     8     Pakistan  1992
4     855     4    Sri Lanka  1996
   Points  Rank          Team
0     895     1  South Africa
1     764     5   New Zealand
2     656     9     Zimbambwe
----------------------------------------------------------------
   Points_x  Rank_x          Team    Year  Points_y  Rank_y
0     874.0     2.0         India  2011.0       NaN     NaN
1     787.0     3.0     Australia  2015.0       NaN     NaN
2     753.0     7.0   West Indies  1979.0       NaN     NaN
3     673.0     8.0      Pakistan  1992.0       NaN     NaN
4     855.0     4.0     Sri Lanka  1996.0       NaN     NaN
5       NaN     NaN  South Africa     NaN     895.0     1.0
6       NaN     NaN   New Zealand     NaN     764.0     5.0
7       NaN     NaN     Zimbambwe     NaN     656.0     9.0


**Inner Join**

---

Inner Join merges two objects based on an intersection of the columns of both the objects:

In [0]:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
                     'key2': ['K0', 'K1', 'K0', 'K1'],
                      'A': ['A0', 'A1', 'A2', 'A3'],
                      'B': ['B0', 'B1', 'B2', 'B3']})
   

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
                      'key2': ['K0', 'K0', 'K0', 'K0'],
                       'C': ['C0', 'C1', 'C2', 'C3'],
                       'D': ['D0', 'D1', 'D2', 'D3']})
   
result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
print(result)

    A   B key1 key2   C   D
0  A0  B0   K0   K0  C0  D0
1  A2  B2   K1   K0  C1  D1
2  A2  B2   K1   K0  C2  D2


**Example2:**

---



In [0]:
import pandas as pd

world_campians = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
                  'Rank':[2,3,7,8,4],
                  'Year':[2011,2015,1979,1992,1996],
                  'Points':[874,787,753,673,855]}
chokers = {'Team':['South Africa','New Zealand', 'Zimbambwe'],
                  'Rank':[1,5,9],
                  'Points':[895,764,656]}

df1 = pd.DataFrame(world_campians)
df2 = pd.DataFrame(chokers)

print(df1)
print(df2)

print('----------------------------------------------------------------')

result = pd.merge(df1,df2,on='Team',how = 'inner')
print(result)

   Points  Rank         Team  Year
0     874     2        India  2011
1     787     3    Australia  2015
2     753     7  West Indies  1979
3     673     8     Pakistan  1992
4     855     4    Sri Lanka  1996
   Points  Rank          Team
0     895     1  South Africa
1     764     5   New Zealand
2     656     9     Zimbambwe
----------------------------------------------------------------
Empty DataFrame
Columns: [Points_x, Rank_x, Team, Year, Points_y, Rank_y]
Index: []


#Concatenation:

---

**Concatenation is the process of combining two or more data structures**

The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes. Note that I say “if any” because there is only a single possible axis of concatenation for Series.

Before diving into all of the details of concat and what it can do, here is a **simple example:**

In [0]:
import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']},
                     index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']},
                     index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                    'B': ['B8', 'B9', 'B10', 'B11'],
                    'C': ['C8', 'C9', 'C10', 'C11'],
                    'D': ['D8', 'D9', 'D10', 'D11']},
                     index=[8, 9, 10, 11])

frames = [df1, df2, df3]

result = pd.concat(frames)
print(result)

      A    B    C    D
0    A0   B0   C0   D0
1    A1   B1   C1   D1
2    A2   B2   C2   D2
3    A3   B3   C3   D3
4    A4   B4   C4   D4
5    A5   B5   C5   D5
6    A6   B6   C6   D6
7    A7   B7   C7   D7
8    A8   B8   C8   D8
9    A9   B9   C9   D9
10  A10  B10  C10  D10
11  A11  B11  C11  D11


**Example2: **

---



In [0]:
import pandas as pd

world_campians = {'Team':['India', 'Australia','West Indies', 'Pakistan', 'Sri Lanka'],
                  'Rank':[2,3,7,8,4],
                  'Year':[2011,2015,1979,1992,1996],
                  'Points':[874,787,753,673,855]}
chokers = {'Team':['South Africa','New Zealand', 'Zimbambwe'],
                  'Rank':[1,5,9],
                  'Points':[895,764,656]}

df1 = pd.DataFrame(world_campians)
df2 = pd.DataFrame(chokers)
print(pd.concat([df1,df2]))

   Points  Rank          Team    Year
0     874     2         India  2011.0
1     787     3     Australia  2015.0
2     753     7   West Indies  1979.0
3     673     8      Pakistan  1992.0
4     855     4     Sri Lanka  1996.0
0     895     1  South Africa     NaN
1     764     5   New Zealand     NaN
2     656     9     Zimbambwe     NaN


#Descriptive statistics

---

There exists a large number of methods for computing descriptive statistics and other related operations on *Series,* *DataFrame*, and *Panel*. Most of these are aggregations (hence producing a lower-dimensional result) like **sum()**, **mean()**, and **quantile()**, but some of them, like **cumsum() ** and **cumprod()**, produce an object of the same size. Generally speaking, these methods take an ***axis*** argument, just like *ndarray.{sum, std, …}*, but the axis can be specified by name or integer:

*   Series: no axis argument needed
*   DataFrame: “index” (axis=0, default), “columns” (axis=1)
*   Panel: “items” (axis=0), “major” (axis=1, default), “minor” (axis=2)

---

For example:

In [0]:
df = pd.DataFrame({
    'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

print(df)
print('----------------------------------------------------------------------')

print(df.mean(0))
print('----------------------------------------------------------------------')

print(df.mean(1))
print('----------------------------------------------------------------------')

print(df.sum(0, skipna=False))
print('----------------------------------------------------------------------')

print(df.sum(axis=1, skipna=True))
print('----------------------------------------------------------------------')

#Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, 
#like standardization (rendering data zero mean and standard deviation 1), very concisely:

ts_stand = (df - df.mean()) / df.std()
print(ts_stand.std())

print('----------------------------------------------------------------------')

xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)
print(xs_stand.std(1))

        one     three       two
a -1.097036       NaN -0.365077
b  1.485938 -1.334171  0.307014
c -1.848928 -0.590834 -0.077895
d       NaN  0.402767 -0.684203
----------------------------------------------------------------------
one     -0.486675
three   -0.507413
two     -0.205040
dtype: float64
----------------------------------------------------------------------
a   -0.731056
b    0.152927
c   -0.839219
d   -0.140718
dtype: float64
----------------------------------------------------------------------
one           NaN
three         NaN
two     -0.820161
dtype: float64
----------------------------------------------------------------------
a   -1.462113
b    0.458781
c   -2.517657
d   -0.281436
dtype: float64
----------------------------------------------------------------------
one      1.0
three    1.0
two      1.0
dtype: float64
----------------------------------------------------------------------
a    1.0
b    1.0
c    1.0
d    1.0
dtype: float64


**Here is a quick reference summary table of common functions. Each also takes an optional level parameter which applies only if the object has a hierarchical index.**

---

**Function	----    Description**

---

***count***	---- Number of non-NA observations

***sum*** ----	Sum of values

***mean*** ----	Mean of values

***mad*** ----	Mean absolute deviation

***median*** ----	Arithmetic median of values

***min*** ----	Minimum

***max*** ----	Maximum

***mode*** ----	Mode

***abs*** ----	Absolute Value

***prod*** ----	Product of values

***std*** ----	Bessel-corrected sample standard deviation

***var*** ----	Unbiased variance

***sem*** ----	Standard error of the mean

***skew*** ----	Sample skewness (3rd moment)

***kurt*** ----	Sample kurtosis (4th moment)

***quantile*** ----	Sample quantile (value at %)

***cumsum*** ----	Cumulative sum

***cumprod*** ----	Cumulative product

***cummax*** ----	Cumulative maximum

***cummin*** ----	Cumulative minimum

---

**Note: **that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

#Summarizing data: describe


---

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

In [0]:
series = pd.Series(np.random.randn(1000))

series[::2] = np.nan

print(series.describe())

count    500.000000
mean      -0.031346
std        0.952589
min       -2.760332
25%       -0.716598
50%       -0.071445
75%        0.568960
max        3.299940
dtype: float64


In [0]:
frame = pd.DataFrame(np.random.randn(1000, 5),
                    columns=['a', 'b', 'c', 'd', 'e'])
   
frame.iloc[::2] = np.nan

print(frame.describe())

                a           b           c           d           e
count  500.000000  500.000000  500.000000  500.000000  500.000000
mean    -0.019253   -0.013860   -0.022647   -0.002475   -0.041288
std      1.031162    0.951880    1.016555    1.001741    1.019782
min     -3.297908   -3.483351   -3.183004   -2.887419   -2.788561
25%     -0.717033   -0.694187   -0.692025   -0.651974   -0.690513
50%     -0.036784   -0.023813   -0.042696    0.010926   -0.072090
75%      0.632111    0.613831    0.668130    0.630013    0.564886
max      3.972452    2.800126    2.910385    2.997069    4.394397


**For a non-numerical Series object,** **describe()** will give a simple summary of the number of unique values and most frequently occurring values:

In [0]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

print(s.describe())

count     9
unique    4
top       a
freq      5
dtype: object


#Reindexing and altering labels

---
reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a particular axis. This accomplishes several things:

*   Reorders the existing data to match a new set of labels
*   Inserts missing value (NA) markers in label locations where no data for that label existed
*   If specified, fill data for missing labels using logic (highly relevant to working with time series data)

Here is a simple example:


In [0]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

print(s)

print('---------------------------------------------------------------')
print(s.reindex(['e', 'b', 'f', 'd']))

a   -1.448233
b   -1.002762
c    1.266810
d   -1.890333
e   -0.102553
dtype: float64
---------------------------------------------------------------
e   -0.102553
b   -1.002762
f         NaN
d   -1.890333
dtype: float64


Here, the f label was not contained in the Series and hence appears as NaN in the result.

---

With a DataFrame, you can simultaneously reindex the index and columns:

In [0]:
print(df)

print('---------------------------------------------------------------------------')
print(df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one']))

        one     three       two
a -1.097036       NaN -0.365077
b  1.485938 -1.334171  0.307014
c -1.848928 -0.590834 -0.077895
d       NaN  0.402767 -0.684203
---------------------------------------------------------------------------
      three       two       one
c -0.590834 -0.077895 -1.848928
f       NaN       NaN       NaN
b -1.334171  0.307014  1.485938


#Renaming / mapping labels

---



The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function

---



In [0]:
print(s)

print('-----------------------------------------------------------------')
print(s.rename(str.upper))

a   -1.448233
b   -1.002762
c    1.266810
d   -1.890333
e   -0.102553
dtype: float64
-----------------------------------------------------------------
A   -1.448233
B   -1.002762
C    1.266810
D   -1.890333
E   -0.102553
dtype: float64


In [0]:
print(df)

print('----------------------------------------------------------------')

print(df.rename({'one': 'foo', 'two': 'bar'}, axis='columns'))

        one     three       two
a -1.097036       NaN -0.365077
b  1.485938 -1.334171  0.307014
c -1.848928 -0.590834 -0.077895
d       NaN  0.402767 -0.684203
----------------------------------------------------------------
        foo     three       bar
a -1.097036       NaN -0.365077
b  1.485938 -1.334171  0.307014
c -1.848928 -0.590834 -0.077895
d       NaN  0.402767 -0.684203


#Function application

---

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

1. Tablewise Function Application: ***pipe()***
2. Row or Column-wise Function Application: ***apply()***
3. Aggregation API: ***agg()*** and ***transform()***
4. Applying Elementwise Functions: ***applymap()***