## Python Data Analysis Library

### pandas
is an open source, *BSD-licensed* library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.



1.   High Performance, Easy-to-use open source library for Data Analysis
2.   Creates tabular format of data from different sources like csv, json, database
3.   Has utilities for descriptive statistics, aggregation, handling missing data
4.   Database utilities like merge, join are available
5.   Fast, Programmable & Easy alternative to spreadsheets




https://pandas.pydata.org/

![alt text](https://pandas.pydata.org/_static/pandas_logo.png)

### Highlights Of Pandas

A fast and efficient DataFrame object for data manipulation with integrated indexing

Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

Flexible reshaping and pivoting of data sets

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

Columns can be inserted and deleted from data structures for size mutability

Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets

High performance merging and joining of data sets

Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure

Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data

Highly optimized for performance, with critical code paths written in Cython or C.

Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.




### Installation 

**Anaconda**

conda install pandas


**PyPI**

pip install  pandas



The latest version *v0.24.2 Final * released on March, 2019

**Importing Pandas**

In [1]:
import pandas as pd

###  Understanding Series & DataFrames
Series represents one column

Combine multiple columns to create a table ( .i.e DataFrame )

In [2]:
series_01 = pd.Series(data=[1,2,3,4,5], index=['a','b','c','d','e'])
series_01

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [3]:
series_02 = pd.Series(data=[9,8,7,6,5], index=list('abcde'))
series_02

a    9
b    8
c    7
d    6
e    5
dtype: int64

In [4]:
df = pd.DataFrame({'A':series_01, 'B':series_02})
print(type(df.A))

<class 'pandas.core.series.Series'>


In [5]:
import numpy as np
# create a random 10 x 10 dataframe
dataframe_01 = pd.DataFrame(data=np.random.randint(1,10,size=(10,10)), index=list('ABCDEFGHIJ'), columns=list('abcdefghij'))
# # print(type(dataframe_01.a))
# b = dataframe_01.groupby(['a']).b.count()
# type(b)

### Loading CSV

In [6]:
from google.colab import files
files.upload()

ModuleNotFoundError: No module named 'google'

In [None]:
iris_data = pd.read_csv('iris.csv')
print(type(iris_data))

In [None]:
iris_data.head(10)

In [None]:
iris_data.tail(7)

In [None]:
iris_data.describe()

In [None]:
iris_data.info()

In [None]:
iris_data.petal_width.value_counts()

In [None]:
iris_data.sepal_width

In [None]:
iris_data.sepal_width.head()

In [None]:
iris_data.petal_width.tail()

### Loading JSON

In [None]:
files.upload()

In [None]:
movie_data = pd.read_json('movie.json.txt')
movie_data

In [None]:
movie_data.describe()

### Accessing subset of data - rows, columns, filters

In [None]:
files.upload()

In [None]:
hr_data = pd.read_csv('HR_comma_sep.csv.txt')
hr_data.head(5)

In [None]:
hr_data.head()

In [None]:
hr_data.columns

In [None]:
hr_data.info()

In [None]:
cat_cols_data = hr_data.select_dtypes('float64')
cat_cols_data.head()

In [None]:
hr_data[['satisfaction_level','last_evaluation','number_project']].head(10)

In [None]:
hr_data.satisfaction_level[:5]

In [None]:
hr_data[['satisfaction_level', 'last_evaluation']][50:100]

In [None]:
movie_data

In [None]:
movie_data.loc['Scarface']

In [None]:
movie_data.loc['Goodfellas']

In [None]:
movie_data.iloc[1:4]

In [None]:
movie_data[1:4]

In [None]:
movie_data.iloc[1:4]

In [None]:
movie_data[ (movie_data['Adam Cohen'] <= 3)]

In [None]:
movie_data[ ((movie_data['Adam Cohen'] > 3) & (movie_data['David Smith'] > 4))]

### Handling missing data


*   Machine Learning algorithms don't expect data missing

*   If there is a columns with more than 40% data missing, we may drop the column

*   For rows with, important column values missing. Drop the rows



In [None]:
# Get all the rows for which column 'Bill Duffy' is missing
movie_data['Bill Duffy'].notnull()

In [None]:
movie_data[movie_data['Bill Duffy'].notnull()]

In [None]:
movie_data[movie_data['Bill Duffy'].isnull()]

### Dropping Rows & Columns

In [None]:
files.upload()

In [None]:
titanic_data = pd.read_csv('titanic-train.csv.txt')
titanic_data.head()

In [None]:
titanic_data.info()

In [None]:
titanic_data.drop(['Cabin'],axis=1,inplace=True)

In [None]:
titanic_data.head()

In [None]:
titanic_data.info()

Lets,  drop all rows with missing values

We don't have inplace = True, so doesn't modify the dataframe

In [None]:
titanic_data.dropna().info()

### Function Application
Map for transforming one column to another

Can be applied only to series

In [None]:
 titanic_data_age = titanic_data[titanic_data.Age.notnull()]

In [None]:
titanic_data_age.info()

In [None]:
titanic_data['age_category'] = titanic_data.Age.map(lambda age: 'Kid' if age < 18 else 'Adult')

In [None]:
titanic_data.head(25)

In [None]:
titanic_data.Age.apply('sum')

In [None]:
titanic_data_age.Age.apply('sum')

In [None]:
titanic_data_age.Age.apply(lambda age: 'Kid' if age < 18 else 'Adult')[:100]


In [None]:
titanic_data_age.Age.map(lambda age: 'Kid' if age < 18 else 'Adult')[:10]

In [None]:
titanic_data[titanic_data.Name.str.contains('Mrs.')].head()

### Append , Merge, Join & Concatenate
Append for stacking dataframe

In [None]:
df1 = pd.DataFrame(data=np.random.randint(1,10,size=(10,3)), columns=list('ABC'))
df1

In [None]:
df2 = pd.DataFrame(data=np.random.randint(1,10,size=(10,3)), columns=list('ABC'))
df2

In [None]:
df1.append(df2, ignore_index=True)

In [None]:
import pandas as pd
df=pd.DataFrame({'a': [1, 2, 2], 'b': [3, 5, 2]})
mask=[True, False, True]
print(df)
df[mask]

This marks the end of Pandas tutorial.

Remember practise is the key to achieve proficiency.