# Introduction
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. pandas is well suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

Ordered and unordered (not necessarily fixed-frequency) time series data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas,

Series (1-dimensional) and
DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.
pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:

Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data

Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations

Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data

Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects

Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

Intuitive merging and joining data sets

Flexible reshaping and pivoting of data sets

Hierarchical labeling of axes (possible to have multiple labels per tick)

Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format

Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks.

In [None]:
! pip install pandas

In [None]:
import pandas as pd       # import pandas library in notebook

In [None]:
pd.__version__            # check the version of the pandas

# Creation of Pandas Series
A pandas series can be created using list,tuple,dictionary and numpy array. To create series use pandas.Series()

    Series - One dimensional Data Type in Pandas 
            Labled values in Series
                Header / Column Name
                WE can give name explicitlity using name parameter
            Index
                By Default we will get Int indexes starting from 0 to len-1
                However we can pass indexed Explicitly using index parameter

            If we are using list tuple or array no of values in these iterable should be equal to no of values in index parameter
            If we are using scalar value int,str,bool,float it will create Series according to index and fill the scalar value
            If we are using dictionary it will fill values according to the index parameter from the given dictionary

In [None]:
pd.Series(1.0,index=[100,20])

In [None]:
pd.Series(('Aditya','Rishi','Vineet'),index=['A','R','V'],name='DA3').dtype

In [None]:
population_dict = {
    'California':38956785,
    'Texas':26441568,
    'New York':19555647,
    'Florida':12364569,
    'Illinois':12882135
}

population_dict

In [None]:
type(population_dict)

In [None]:
population = pd.Series(population_dict)
population

In [None]:
population['California':'Florida']

In [None]:
population.ndim

In [None]:
population.shape

In [None]:
population.size

In [None]:
population.dtype

In [None]:
population['California']

In [None]:
population.index

In [None]:
population['California']

In [None]:
population[0]

In [None]:
population

In [None]:
population['California':'New York']

In [None]:
lst=[2,4,6,8]


In [None]:
pd.Series([2,4,6,8])  # creating series with list


# default index is created in pandas using numpy.arange(len(list))

In [None]:
pd.Series((1,2,3,4))       # creating series with tuple

In [None]:
import numpy as np

In [None]:
x=np.array([9,10,11,12,13])
pd.Series(x)                   # creating series with numpy array

In [None]:
#create a series using numpy array in below cell take any random numbers

In [None]:
#answer here

In [None]:
pd.Series( [2,4,6] , index=[100,200,'delhi'] )               # setting up the customized index to array

In [None]:
pd.Series( 5 )                          # creating array for a scaler value

In [None]:
pd.Series( 5, index=[100,200,300] )

In [None]:
pd.Series( 5 , index=[100,200,300] )


# it gives preferances to the index

In [None]:
pd.Series( {2:'a' ,  1:'b', 3:'c'} )

In [None]:
pd.Series( {2:'a' ,  1:'b', 3:'c'}, index=[3,2,4,5] )
# if key of the dictonary doesn't match with index then it will fill the series with NaN values

# Type and Shape of the series
To check the type of the Series use type() and to check the shape of the series use .shape.

In [None]:
population

In [None]:
type(population)

In [None]:
population.shape # Series

# Data Frame
Creating Pandas DataFrame
A pandas DataFrame can be created using pandas Series, dictionary, list, tuple and numpy array.

To create pandas DataFrame use pd.DataFrame()

    pd.DataFrame(<data>)

In [None]:
pd.DataFrame(pd.Series([i for i in range(1,20,2)],name='Odd_numbers'))

In [None]:
population_dict = {
    'California':38956785,
    'Texas':26441568,
    'New York':19555647,
    'Florida':12364569,
    'Illinois':12882135
}

In [None]:
popupation_series=pd.Series(population_dict)

In [None]:
pd.DataFrame(popupation_series)

In [None]:
population_dict = {
    'California':38956785,
    'Texas':26441568,
    'New York':19555647,
    'Florida':12364569,
    'Illinois':12882135
}

population=pd.Series(population_dict)
df = pd.DataFrame(population,columns=['population'])
df

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.ndim

In [None]:
#create a datafram for this in the below cell
city_networth = {
    'California':126789,
    'Texas':939393,
    'New York':348030340,
    'Florida':4384320,
    'Illinois':98540340
}

In [None]:
#answer here

In [None]:
df.shape # DataFrame shape will give you the information similar to 2D numpy array with rows and columns

In [None]:
df = pd.DataFrame(population, columns=['Population'])  # changing the name of the column while creating the DataFrame
df

In [None]:
lst=[
    [1,2,3],
    ['a','b','c'],
    [True,False]
]

In [None]:
[1,2,3]

In [None]:
['a','b','c']

In [None]:
pd.DataFrame(lst)

In [None]:
# Creating the pandas DataFrame with list of the dictionries
data = [ {'a':i, 'b':i} for i in range(1,4) ]
data

In [None]:
df = pd.DataFrame(data)
df

In [None]:
data=[{'a': 80, 'b': 1}, {'a': 90, 'b': 2}, {'a': 95, 'b': 3}]

In [None]:
data=[
    [80,1],
    [90,2],
    [95,3]
]

In [None]:
# a customized index can be setup for a smaller DataFrame
df = pd.DataFrame(data, index=['Aditya','Vijay','Vineet'])
df

In [None]:
df.index

In [None]:
df.columns

In [None]:
# to check the name of the columns use df.columns
df.columns

In [None]:
df.columns[0]

In [None]:
# to change the names of the columns use 'df.columns=list of the new column names'
df.columns=['Excel','Rank']
df

In [None]:
# to cange the index of the DataFrame use after creating the DataFrame use 'df.index=list of the new index values'
df.index=['i','ii','iii']
df

In [None]:
df.columns=['column1','column2']
df

In [None]:
print(df.columns)

In [None]:
print(df.columns[0])  # to get name of the column according to index position use df.column[index of the column]

In [None]:
print(list(df.columns))     # to get the list of the column names 

In [None]:
df.index   # to check the index values

In [None]:
print(list(df.index))    # to get the list of the index values

In [None]:
df

In [None]:
# if a DataFrame is created using the dictionary and the keys are not matched in two dictionaries then it will be filled with NaN values

pd.DataFrame( [ {'a':1,'b':2} , {'b':3,'c':4} ] )  

In [None]:
# To create a DataFrame with numpy array
import numpy as np
pd.DataFrame(np.random.rand(3,2), columns=['Class1','Class2'], index=['a','b','c'])

In [None]:
pd.DataFrame(np.random.rand(3,2))

In [None]:
# List of Dictionaries

d = [ 
          {'city':'Delhi','data':1000}, 
          {'city':'Mumbai','data':2000}, 
          {'city':'Bangalore','data':1500} 
     ]
pd.DataFrame(d,index=[1,2,3])

# Importing Data to create DataFrame
To create pandas DataFrame, a data can be imported from csv,excel files. Also, data can be imported from RDBMS such as MySql.

To create DataFrame from csv file use following syntax:

pandas.read_csv(filepath_or_buffer, names=_NoDefault.no_default, index_col=None, squeeze=None, skipinitialspace=False, skiprows=None, skipfooter=0, na_values=None)

Not all parameters are mentioned in the syntax. If we have properly arranged and cleaned data just use :

pandas.read_csv(file_path)

In [None]:
# csv files
import pandas as pd
path='https://raw.githubusercontent.com/ubaid-shah/datasets/main/student_records.csv'
df = pd.read_csv(path)
df

In [None]:
import pandas as pd
df=pd.read_csv(r'A:\OneDrive\work\acciojob\python\accio\bigmart_data.csv') # absolute path
pd.read_csv('bigmart_data.csv') # relative path

# Data Exploration
Data exploration will help finding the information about the data. In pandas it can be done using diiferent ways.

Viewing/Inspecting Data
Use these commands to take a look at specific sections of your pandas DataFrame or Series.

df.head(n) | First n rows of the DataFrame

By default, it returns the first 5 rows of the Dataframe.

df.tail(n)| Last n rows of the DataFrame

By default, it returns the last 5 rows of the Dataframe.

df.shape| Number of rows and columns

df.size | Number of elements in this object.

Return the number of rows if Series, otherwise returns the number of rows times the number of columns if DataFrame.

df.info()| Index, Datatype and Memory information

df.describe()| Summary statistics for numerical columns

df.value_counts(dropna=False) | View unique values and counts

df.apply(pd.Series.value_counts) | Unique values and counts for all columns

df.ndim| Returns dimension of dataframe/series.

1 for one dimension (series), 2 for two dimensions (dataframe).

df.sample( ) | generate a sample randomly either row or column.

It allows you to select values randomly from a Series or DataFrame. It is useful when we want to select a random sample from a distribution.

df.isna( ) or df.isnull() | This function returns a dataframe filled with boolean values with true indicating missing values.

df.isnull( ).sum( ) | Return the number of missing values in each column.

df.dropna( ) | remove a row or a column from a dataframe that has a NaN or missing values in it.



# Basic Data Exploration

In [None]:
df

In [None]:
df.head() # 5 rows

In [None]:
df.head(3)

In [None]:
df.head(1)

In [None]:
df.tail() # last 5 rows

In [None]:
df.tail(2)

In [None]:
df

In [None]:
len(df.columns)

In [None]:
df.shape

In [None]:
df['ProjectScore']

In [None]:
df.ProjectScore

In [None]:
ser = df.ProjectScore  # create a series from one column from the dataset
print(ser)
print(type(ser))
print(ser.shape)

In [None]:
df['ProjectScore']               # create a series from one column from the dataset

In [None]:
df.ProjectScore

In [None]:
df.shape   # no of rows and columns of the DataFrame

In [None]:
df.ndim    # Returns dimension of dataframe/series. 1 for one dimension (series), 2 for two dimensions (dataframe).


In [None]:
df.sample()                  # generate a sample randomly either row or column. 

In [None]:
df.sample(3)                  # generate a number of samples randomly either row or column. 

In [None]:
df.columns         # columns of the DataFrame

In [None]:
list(df.columns)        # create a list of the column names

In [None]:
df.info()                       # Index, Datatype and Memory information

# Statistical Summary of Pandas DataFrame

In [None]:
# Summary of the numeric columns
df.describe()

The above DataFrame show information as total number of records in the specific columns in count. The values of mean ,min max and standard deviation (std) are shown. Also quantile values of the data is also given as 25%,50% and 75%. Here 50% is median value.

In [None]:
df.describe().T   # transpose the describe DataFrame

In [None]:
df['ResearchScore'].describe()          # to get statistical information for specific column

In [None]:
df['ResearchScore'].quantile(0.5)   # to get specific quantile values  50% quantile is median of the data

In [None]:
df['ResearchScore'].quantile(0.25)

In [None]:
round(df[['ResearchScore','ProjectScore']].quantile(0.75),2)   # rount off the 75% value to 2 decimal points

In [None]:
round(df[['ResearchScore','ProjectScore']].describe(),2)


In [None]:
round(df[['ResearchScore','ProjectScore']].std(),2)

In [None]:
df.describe()

In [None]:
round(df.describe(),2)   # to round off the total DataFrame values to 2 decimal points

In [None]:
df['ResearchScore'].mean()

In [None]:
df['ResearchScore'].count()

In [None]:
# to get the names of the Numeric columns
df.describe().columns

In [None]:
df.info()

In [None]:
# df.drop(coulumns,axis=1) will delete the columns from the DataFrame
x = df.drop('Name',axis=1) 
x

In [None]:
set(x.columns)       # to create the set of the columns

In [None]:
df.describe()

In [None]:
set(x.describe().columns)

In [None]:
set(x.columns)-set(x.describe().columns)     # to get the names of the categorical columns

In [None]:
cat_col = list(set(x.columns) - set(x.describe().columns))
cat_col

In [None]:
x[cat_col] # call only the specific data according to the column name

In [None]:
df[cat_col].describe()   # stastical info of the categorical columns

In [None]:
x.describe(include='object')     # to get the stastical info of the categorical column 

In [None]:
x.describe(include='all')       # to get the stastical info of the numerical as well as categorical columns

# Value counts

In [None]:
df['Recommend'].value_counts()    # View unique values and counts

In [None]:
df['Recommend'].value_counts(normalize=True)  # View unique values and counts in fraction between 0 to 1.

In [None]:
df['Recommend'].value_counts(normalize=True) *100 # View unique values and counts in percentage

In [None]:
df["OverallGrade"].value_counts() 

The above value_counts will not include the NaN values. To consider the NaN values use `.value_counts(dropna=False)`

In [None]:
df["OverallGrade"].value_counts(dropna=False) 

In [None]:
df["OverallGrade"].value_counts(dropna=False,normalize=True) 

In [None]:
df["OverallGrade"].value_counts(dropna=False,normalize=True) *100

In [None]:
df.value_counts(dropna=False)

In [None]:
df.apply(pd.Series.value_counts) # Unique values and counts for all columns

# Dealing with Null Values

Checking the null values in the DataFrame

In [None]:
df.isna( ) # This function returns a dataframe filled with boolean values with true indicating missing values.

In [None]:
df.isnull()

In [None]:
df.isnull().sum() # Return the number of missing values in each column.

In [None]:
df.shape

In [None]:
df.shape[0]

In [None]:
df.isnull().sum() *100/df.shape[0] # Return the number of missing values in each column in percentage.

Handeling The Null Values
In order to deal with NaN or Null values there are multiple ways according to the projects or data as follows.

    Drop all null values
    Fill Null values

In [None]:
s=pd.Series()
s

In [None]:
df['s']=s

In [None]:
x=df.dropna(axis=0) # remove a row from a dataframe that has a NaN or missing values in it.
x

In [None]:
x=df.dropna(axis=1) # remove a column from a dataframe that has a NaN or missing values in it.
x

In [None]:
x=df.dropna(how='any')        # ‘any’ : If any NA values are present, drop that row or column.
    
x

In [None]:
df

In [None]:
dfna=pd.read_csv('../data/na.csv')
dfna

In [None]:
x=df.dropna(how='all') # ‘all’ : If all values are NA, drop that row or column.
x

In [None]:
df

In [None]:
x=df
x

In [None]:
x.dropna(inplace=True) # Whether to modify the DataFrame rather than creating a new one.
x

# Fill Null values :
Filling null values with two methods

i. .fillna()

ii. .interpolate()

In [None]:
path='https://raw.githubusercontent.com/ubaid-shah/datasets/main/student_records.csv'
df = pd.read_csv(path)
df

In [None]:
df['OverallGrade'].fillna('A')

In [None]:
x=df.fillna('India')
x

In [None]:
df

In [None]:
df['ProjectScore'].fillna(x['ProjectScore'].mean())

In [None]:
x['ProjectScore'].mean()

In [None]:
df

In [None]:
x.fillna(method='bfill'),x.fillna(method='ffill')

In [None]:
x.fillna(method='bfill', axis=0)

In [None]:
x.fillna(method='ffill')

In [None]:
x.fillna(method='ffill', axis=1)

In [None]:
df['Recommend'].unique()   # to find the unique values in the particular column

In [None]:
df['OverallGrade'].unique()

In [None]:
df['Recommend']=df['Recommend'].replace("NO","No")    # to replace the value in the particular column

In [None]:
df['Recommend'].unique()

In [None]:
df

In [None]:
df['ResearchScore'].sum()

In [None]:
col_num = df.describe().columns
x = df[col_num]
x

In [None]:
x.sum()

In [None]:
x.sum(axis=0)

In [None]:
x.sum(axis=1)

In [None]:
x.mean(axis=0)

In [None]:
x.mean(axis=1)

In [None]:
x.count(axis=0)

In [None]:
x.count(axis=1)

In [None]:
x.min(axis=0)

In [None]:
x.min(axis=1)

In [None]:
x.max(axis=0)

In [None]:
x.max(axis=1)

In [None]:
x.median(axis=0)

In [None]:
x.median(axis=0)

In [None]:
df.mode(axis=0)

In [None]:
x.std(axis=0)

In [None]:
x.std(axis=1)

In [None]:
x.var(axis=0)

In [None]:
x.var(axis=1)

In [None]:
x.cov()

In [None]:
x.corr()

In [None]:
# Pandas dataframe.cumsum() is used to find the cumulative sum value over any axis. 
# Each cell is populated with the cumulative sum of the values seen so far.
print(x)
print()
print(x.cumsum())

In [None]:
df
df=df.sort_values(by='ResearchScore')
df

In [None]:
df=df.sort_values(by='ResearchScore',ascending=False)
df

In [157]:
df=df.sort_values(by=['OverallGrade','ResearchScore'],ascending =False)
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
2,David,F,N,10,17.0,No
4,Marvin,E,N,20,30.0,No
9,Chris,D,U,25,15.0,No
1,John,C,N,85,51.0,Yes
7,Trent,C,Y,75,33.0,No
3,Holmes,B,Y,75,71.0,No
6,Robert,B,Y,60,59.0,No
5,Simon,A,Y,92,79.0,Yes
0,Henry,A,Y,90,85.0,Yes
8,Judy,,Y,25,,No


In [None]:
x

In [None]:
x.reset_index(inplace=True)
x

In [None]:
x.drop(columns=['level_0'],inplace=True)
x

In [159]:
x=df.sort_values(by=['ResearchScore','ProjectScore'])
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
2,David,F,N,10,17.0,No
4,Marvin,E,N,20,30.0,No
9,Chris,D,U,25,15.0,No
8,Judy,,Y,25,,No
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
3,Holmes,B,Y,75,71.0,No
1,John,C,N,85,51.0,Yes
0,Henry,A,Y,90,85.0,Yes
5,Simon,A,Y,92,79.0,Yes


In [160]:
x

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
2,David,F,N,10,17.0,No
4,Marvin,E,N,20,30.0,No
9,Chris,D,U,25,15.0,No
8,Judy,,Y,25,,No
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
3,Holmes,B,Y,75,71.0,No
1,John,C,N,85,51.0,Yes
0,Henry,A,Y,90,85.0,Yes
5,Simon,A,Y,92,79.0,Yes


In [161]:
x.sort_index()                  # sort the DataFrame with respect to index

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85.0,Yes
1,John,C,N,85,51.0,Yes
2,David,F,N,10,17.0,No
3,Holmes,B,Y,75,71.0,No
4,Marvin,E,N,20,30.0,No
5,Simon,A,Y,92,79.0,Yes
6,Robert,B,Y,60,59.0,No
7,Trent,C,Y,75,33.0,No
8,Judy,,Y,25,,No
9,Chris,D,U,25,15.0,No


# Drop duplicates
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.

DataFrame.drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.

In [168]:
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5] 
    })
df

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [166]:
# By default, it removes duplicate rows based on all columns.
df.drop_duplicates()

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [173]:
# To remove duplicates on specific column(s), use subset.

df.drop_duplicates(subset=['brand'],keep='first')

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5


In [170]:
# To remove duplicates and keep last occurrences, use keep.

df.drop_duplicates(subset=['brand', 'style'], keep='last')

Unnamed: 0,brand,style,rating
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
4,Indomie,pack,5.0
