# Creating series
The constructor for series is: pandas.Series(data, index, dtype, copy)

Here,

data: data (can be lists, ndarrays, dictionaries etc.)

index: unique, hashable and same length as data (default is np.arange(n) where n is length of data)

dtype: data type of series values

copy: copy data (default False)

# From NumPy ndarray

In [3]:
import numpy as np
import pandas as pd

labels = ['a','b','c']
my_data = [10,20,30]
arr = np.array(my_data)

print(pd.Series(my_data))
print('==================')
print(pd.Series(my_data,index=labels))

0    10
1    20
2    30
dtype: int64
a    10
b    20
c    30
dtype: int64


# From Dictionary
When the index is not specified, then the keys are taken in a sorted order as index values.

If the index is passed, values in data corresponding to the labels in the index will be accessed, the index which is absent in the keys of the dictionary will have NaN values.

You can store heterogeneous data while creating a series with a dictionary

In [4]:
lables = ['a','b','c']

# dictionary
dic = {'b':1,'c':2,'d':3}

# Series without specified labels
print(pd.Series(dic))
print('============')

# Series with specified labels
print(pd.Series(dic, labels))

b    1
c    2
d    3
dtype: int64
a    NaN
b    1.0
c    2.0
dtype: float64


# From Scalar
The index is provided and scalar will be repeated to match the value and the length of it.

In [5]:
num = 10

# Series with index ['a','b','c']
print(pd.Series(num,index=['a','b','c']))
print('===============')

# Series with index [0,1,2,3,4]
print(pd.Series(num,index=range(5)))

a    10
b    10
c    10
dtype: int64
0    10
1    10
2    10
3    10
4    10
dtype: int64


# Accessing elements By Position
A series is very similar to a NumPy array (index starts at 0), so data can be accessed in the same manner as we did for NumPy arrays. The syntax remains the same i.e. series[start:stop:step]. Let us understand with an example.

In [6]:
# series of numbers from 11 to 20
ser = pd.Series(data = range(11,21),index=range(10))

# retrieve the first element
print("First element is",ser[0])
print('==========')

#retrieve the first three elements
# ser[:3] -----> first three elements
print("First three elements are",ser[:3].values)
print('==========')

# retrieve index
print(ser.index)
print('==========')

# retrieve data
print(ser.values)
print('==========')

First element is 11
First three elements are [11 12 13]
RangeIndex(start=0, stop=10, step=1)
[11 12 13 14 15 16 17 18 19 20]


# By labels
Single element access: series[index]

Multiple element access: series[[index1, index2, index3, .....]]

Remember to use [[ ]] to access multiple elements via labels

In [7]:
# series of first five multiples of 10
ser = pd.Series(data = [10,20,30,40,50], index = ['a','b','c','d','e'])

# retrieve value at index 'b'
print("Value at index 'b' is ",ser['b'])
print('==========')

# retrieve value at indexes 'a','c' and 'e'
print("Values at indexes 'a','c' and 'e' are ", ser[['a','c','e']].values)
print('==========')

#retrieve value at index 'f' (not present)
try:
    print("Value at index 'f' is",ser['f'])
except KeyError:
    print("There is no such index")

Value at index 'b' is  20
Values at indexes 'a','c' and 'e' are  [10 30 50]
There is no such index


# Dataframe
The constructor for pandas dataframe object is pandas.DataFrame( data, index, columns, dtype, copy).

Here,

data: various forms (ndarray, series, map, lists, dict, constants, another DataFrame)

index: index labels (default np.arange(n))

columns: column names (default np.arange(n)); True only when index is not specified

dtype: Data type of each column

copy: copying of data (default False)

# creating Dataframe using list

In [9]:
#import packages
import pandas as pd
import numpy as np

# list of values (single column)
data = ['Rob','Bobby','John','Danny','Manny']

#construct dataframe with column called 'Name'
df = pd.DataFrame(data, columns = ['Name'])

#display
print(df)

    Name
0    Rob
1  Bobby
2   John
3  Danny
4  Manny


In [10]:
#list of values (two columns)
data =[['Rob',25],['Bobby',30],['John',21],['Danny',32],['Manny',23]]

#construct dataframe with columns called 'Name' and 'Age'
df = pd.DataFrame(data,columns = ['Name','Age'])

#display
print(df)

    Name  Age
0    Rob   25
1  Bobby   30
2   John   21
3  Danny   32
4  Manny   23


# From dictionary
Dictionary of ndarrays/lists: The keys of the dictionary will be the feature names and the values will be the values for that feature across the dataframe. Remember that the ndarrays/lists must have the same length.

In [11]:
#data source
data = {'Name':['Rob','Bobby','John','Danny','Manny'], 'Age':[25,30,21,32,23]}

#construct dataframe
df = pd.DataFrame(data, index = ['R','B','J','D','M'])

#display
print(df)

    Name  Age
R    Rob   25
B  Bobby   30
J   John   21
D  Danny   32
M  Manny   23


# From list of dictionaries
Here, each element corresponds to a row/instance and every element is a dictionary. This dictionary in turn contains the feature names as the keys and feature values as the values of that key. We create the same dataframe as the previous example this time but now as a list of dictionaries.

In [13]:
# data source
data = [{'Name':'Rob','Age':25},{'Name':'Bobby','Age':30},
        {'Name':'John','Age':21},{'Name':'Danny','Age':32},
        {'Name':'Manny','Age':23}]

#construct dataframe
df = pd.DataFrame(data, index=['R','B','J','D','M'])

#display
print(df)

    Name  Age
R    Rob   25
B  Bobby   30
J   John   21
D  Danny   32
M  Manny   23


# From series:

In [14]:
#construct the dataframe
df = pd.DataFrame({'Name':pd.Series(['Rob','Bobby','John','Danny','Manny'],index=['R','B','J','D','M']),
                    'Age':pd.Series([25,30,21,32,23],index=['R','B','J','D','M'])})
#display
df

Unnamed: 0,Name,Age
R,Rob,25
B,Bobby,30
J,John,21
D,Danny,32
M,Manny,23


# Column operations

# Selection
 If you have a DataFrame df and you want to select a column col1 you can do it by df[col1]; if you have multiple columns col1, col2, col3 you do it by df[[col1, col2, col3]]

In [15]:
df['Name']

R      Rob
B    Bobby
J     John
D    Danny
M    Manny
Name: Name, dtype: object

# Creation
syntax is: df[new_column] = df[col1] + df[col2] (You can also perform subtraction, division etc.)

eg:- df['Difference']=df['Attack']-df['Defense']

# Deletion
Now you want to delete the column Difference that you had just made. You can do it by df.drop([col1, col2, ...], inplace=True, axis=0/1). Note that inplace=True deletes columns from the dataframe permanently and axis specifies whether to drop across columns (axis=$1$) or rows (axis=$0$)

eg:- df.drop(['Difference'],inplace=True,axis=1)

# Row operations

# Selection
You can access rows by either label of the index using loc or integer (row number) using iloc keyword.

Syntax using loc: df.loc[index]

Syntax using iloc: df.iloc[row number]

eg: -
df.iloc[0,0]
df.loc[0]

# Slicing
Use df[start:end] to slice rows according to row number (not label); 
here end value is not inclusive. Heres how you can slice from row numbers 2 and 3: df[2:4]

# Creation/Addition
Use df.append(data) where data is a DataFrame or Series/dictionary-like object, or list of these.

# Deletion
You can delete rows using the .drop() to drop rows by specifying axis=0 inside the function

# Renaming columns
To rename columns from col1, col2 to newcol1, newcol2, use the function .rename(columns={col1:newcol1, col2:newcol2}, inplace=True) to permanently rename the columns.

# Set index
To set index labels for column column, use set_index(column, inplace=True)

# value_counts()
 It gives a quick count of observations for each level. This doesn't count NAs and can be applied on series objects; not dataframes.
rg:- 
#Counts for different variants of Type 1 pokemons

print(df['Type 1'].value_counts())

# .unique()
All the unique values present in the series, very similar to the set() function

eg:-
#Different variants of Type 1 pokemon

print(df['Type 1'].unique())

# nunique()
Length of the list returned by .unique() method. It is the total number of unique elements in the series

eg:- # How many different variants of Type 1 are there

type_1 = df['Type 1'].nunique()
print(type_1)

# Apply Functions
Functions can be applied along the axes of a DataFrame using the .apply() the method

In [1]:
# minumum value
lower = np.min(df['Attack'])

# maximum value
upper = np.max(df['Attack'])

# range
limit = upper - lower

# mean 
mean = np.mean(df['Attack'])

# function
def standardize(x,x_mean,x_range):
    return (x-x_mean)/x_range

# apply for 'Total' column
print(df['Attack'].apply(lambda x:standardize(x,mean,limit)))

NameError: name 'np' is not defined

# group by
In pandas we do it with the help of .groupby() function which returns a GroupBy object. Lets understand it through an example where we will group Pokemons according to Generation:

df.groupby('Generation')

# Inspecting groups
Now that we have created the groups, how to inspect them? Well, just use the .groups the attribute of the groupby object. It returns a dictionary where keys are the categories and values are the row labels for that category. For example, if we do df.groupby('Generation').groups we will get a dictionary with categories of Generation as keys and row labels (names of Pokemons) as values.

df.groupby('Generation').groups

output:-

{1: Index([       'Bulbasaur',          'Ivysaur',         'Venusaur',
           'Mega Venusaur',       'Charmander',       'Charmeleon',
               'Charizard', 'Mega Charizard X', 'Mega Charizard Y',
                'Squirtle',
        ...
                'Articuno',           'Zapdos',          'Moltres',
                 'Dratini',        'Dragonair',        'Dragonite',
                  'Mewtwo',    'Mega Mewtwo X',    'Mega Mewtwo Y',
                     'Mew'],
       dtype='object', name='Name', length=166),
       
 2: Index(['Chikorita', 'Bayleef', 'Meganium', 'Cyndaquil', 'Quilava',
        'Typhlosion', 'Totodile', 'Croconaw', 'Feraligatr', 'Sentret',
        ...
        'Raikou', 'Entei', 'Suicune', 'Larvitar', 'Pupitar', 'Tyranitar',
        'Mega Tyranitar', 'Lugia', 'Ho-oh', 'Celebi'],
       dtype='object', name='Name', length=106),
       
 3: Index(['Treecko', 'Grovyle', 'Sceptile', 'Mega Sceptile', 'Torchic',
        'Combusken', 'Blaziken', 'Mega Blaziken', 'Mudkip', 'Marshtomp',
        ...

# Using aggregate functions on groups
The next logical step after grouping them is the operation we need to perform on these groups. Let's say we want to calculate the median value of Sp. Atk for every Generation.

df.groupby('Generation')[['Sp. Atk']].median()

output:-

Generation Sp. Atk    

    1        65
    
    2        65
    
    3        70
    
    4        71
    
    5        65
    
    6        65

# Sorting
Okay, now we want to sort the median value of Sp. Atk for every Generation in descending order. We can do it with the help of .sort_values(by=column, ascending=False) where the column is the name of the column we want to sort by and ascending=True if we want to sort in ascending order. Let's see how it works out in our case

df.groupby('Generation')[['Sp. Atk']].median().sort_values(by='Sp. Atk',ascending=False)

output:-

Generation Sp. Atk

    4        71
    
    3        70
    
    1        65
    
    2        65
    
    5        65
    
    6        65

# Creating pivot tables
pandas.pivot(data, columns, index, aggfunc) where

data: dataframe to be used for pivot operation

columns: Keys to group by on pivot table column

index: column/array to groupby our data (Will be displayed in the index column (or columns, if you're passing in a list)

values (optional): Column to aggregate (If we do not specify this then the function will aggregate all numeric columns)

aggfunc: Functions to be applied to for every group (by default computes mean)

eg:- pd.pivot_table(df,index='Generation',values='Attack',aggfunc='sum')

pd.pivot_table(df,index=['Legendary','Generation'],values='Attack')

# Merging DataFrame
Syntax of merge

The syntax is pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True) where:

left: dataframe

right: dataframe

on: Columns (names) to join on. Must be found in both the left and right DataFrame objects.

left_on: Columns from the left DataFrame to use as keys (can either be column names or arrays with length equal to the length of the DataFrame).

right_on: Columns from the right DataFrame to use as keys (can either be column names or arrays with length equal to the length of the DataFrame).

left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame.

right_index: Same usage as left_index for the right DataFrame.

how − One of 'left', 'right', 'outer', 'inner'. Defaults to 'inner'

sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True

# Joins in dataframes
Inner Merge / Inner join: The default pandas behaviour; only keep rows where the merge on value exists in both the left and right dataframes.

Left Merge / Left outer join(aka left merge or left join): Keep every row in the left dataframe; where there are missing values of the on variable in the right dataframe, add empty / NaN values in the result.

Right Merge / Right outer join(aka right merge or right join): Keep every row in the right dataframe; where there are missing values of the on variable in the left column, add empty / NaN values in the result.

Outer Merge / Full outer join – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.

# Inner merge
pd.merge(left=attack,right=defense,on='Name',how='inner')

# Outer merge
pd.merge(left=attack,right=defense,on='Name',how='outer')

# Left merge
pd.merge(left=attack,right=defense,on='Name',how='left')

# Right merge
pd.merge(left=attack,right=defense,on='Name',how='right')