<a href="https://colab.research.google.com/github/vyavasthita/dsml_learning/blob/master/pandas/2_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd

In [None]:
!gdown 1E3bwvYGf1ig32RmcYiWc0IXPN-mD_bI_

In [None]:
df = pd.read_csv('mckinsey.csv')
df['country']
df.head()

df[0] will not work because pandas will try to look for a column named 0 in the data frame, which does not exist, so it will throw an error.

But slicing like df[0:4] will work, and it will fetch the first 4 rows.

In [None]:
# df[0]

In [None]:
df[0:4]

In [None]:
df.index  # implicit index

In [None]:
df.index = np.arange(10, 1714)
df

loc: explicit indexing, end is inclusive
iloc: implicit indexing, end is NOT inclusive

In [None]:
# df.loc[1] # will not work because the explicit indexing starts from 10 in the
# above example
df.iloc[1] # this will work because implicit indexing always starts from 0

In [None]:
df.iloc[-1]

In [None]:
# df.loc[-1]  # will not work because the explicit indexing does not have -1

In [None]:
df.iloc[[3, 4, 5]]

In [None]:
df['continent'] == 'Asia'

In [None]:
df.loc[df['continent'] == 'Asia']

In [None]:
temp_df = df.set_index('country')
temp_df
# temp_df.loc['Afghanistan']

In [None]:
# reset index back to 0
df.reset_index()  # it does not change the original df also it will show the old indexing under the index column

To change the original df we need to use inplace = True
To remove the old indexing, we need to use drop = True

In [None]:
df.reset_index(inplace=True, drop=True)
df

To create a new row, we can not use iloc, as it will throw an error if the index is not present.
For this, we need to use loc

In [None]:
# df.iloc[len(df)] = ['Bharat', 2009, 234567, 'Asia', 44.44, 567.987] # throws
# error

In [None]:

df.loc[len(df)] = ['Bharat', 2009, 234567, 'Asia', 44.44, 567.987]  # create a new row
df

If we use the existing index in loc or iloc, then the row will be updated.

So, for adding a new row, we need to use loc.

For updating an existing row, we can use either loc or iloc.

In [None]:
# Add duplicate records
df.loc[len(df)] = ['India', 2009, 234567, 'Asia', 44.44, 567.987]
df.loc[len(df)] = ['India', 2009, 234567, 'Asia', 44.44, 567.987]
df.loc[len(df)] = ['India', 2009, 234567, 'Asia', 44.44, 567.987]
df.loc[len(df)] = ['India', 2009, 234567, 'Asia', 44.44, 567.987]
df.loc[len(df)] = ['India', 2009, 234567, 'Asia', 44.44, 567.987]
df

In [None]:
# Remove duplicate rows
# df.drop([1709, 1710, 1711, 1712, 1713, 1714, 1715, 1716, 1717, 1718])
df

In [None]:
# Finding the duplicates
df.duplicated()

In [None]:
df[df.duplicated()] # search and print duplicates

In [None]:
# Drop duplicates (Search + delete)
df.drop_duplicates()

In [None]:
df.drop_duplicates(keep='last')

In [None]:
df.drop_duplicates(keep='first', inplace=True)

In [None]:
df

By default, duplicates will include all columns for comparing the duplicates.

If we use subset argument, then it will only use columns mentioned in subset for checking duplicates.

In the example below, it will check only columns A and B and compare the consolidated values of A and B.

subset=['A', 'B'] will return Bool array.

Sum will sum True(1) and will output 2 as there are two combinations (index 4 & 6) of the same rows with columns A and B.


In [None]:
data = {'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
        'C': ['small', 'large', 'large', 'small', 'small', 'large', 'large', 'small'],
        'D': [1, 2, 2, 3, 3, 4, 5, 6]}

df2 = pd.DataFrame(data)

print(sum(df2.duplicated(subset=['A', 'B'])))

In [None]:
df2.duplicated(subset=['A', 'B'])

In [None]:
df = pd.read_csv('mckinsey.csv')

Slicing on DF

In [None]:
# By default implicit and explicit indexes are same
df.loc[1]

In [None]:
df.iloc[1]

When we use `loc`, the slicing works only on rows, not on columns.


In [None]:
# df.loc[0:2, 1:3]

When we use iloc, the slicing works both on rows and columns.

In [None]:
df.iloc[0:2, 1:3]

For columns with loc, we need to explicity use the name of the columns with slicing.

In [None]:
df.loc[0:2, 'year':'life_exp'] # columns starting from year till life_exp

In [None]:
df.loc[0:2, ['year','life_exp']] # only columns year and life_exp

In [None]:
df.iloc[0:5:2, 1:4]

In [None]:
df.loc[0:2, 'year':'gdp_cap': 2]

In [None]:
print(df['life_exp'].mean())

In [None]:
df['life_exp'].min()

In [None]:
np.min(df['life_exp']) # Series is 1D array and we can use numpy also

In [None]:
df.sort_values('life_exp', ascending=False)

In [None]:
df.sort_values(['year', 'life_exp'], ascending=[False, True])

In [None]:
df.sort_values(['year', 'life_exp'], ascending=[False, True]).reset_index(drop=True, inplace=True)

In [None]:
df.to_csv('output.csv', index=False)

Concatenation

In [None]:
users = pd.DataFrame(
    {
        'user_id': [101, 102, 103],
        "names": ["Samskriti", "Sabhyata", "Aarush"]
    }
)
users

In [None]:
messages = pd.DataFrame(
    {
        'user_id': [101, 102, 103, 104],
        "msgs": ["Namaste", "Kaise Ho", "Mein Accha Hoon", "Kaha Ho"]
    }
)
messages

In [None]:
pd.concat([users, messages])