# Part 1
## Introduction to Data Frame in pandas
## Create Dataframe 
## Change Layout 
## Rename
## Sort
## Drop

# Introduction to Data Frames in Pandas

- Pandas deals with tabular data ONLY
- Data can be single column (1D) or single row (1D) or row and column form (2D)
    - Single dimension data (1D) is considered as a series
    - Two dimensional data (2D) is considered as a data frame
- More than two dimensions cannot be handled by pandas
- Pandas can load data from csv file, txt file, Excel file, json file etc
- Pandas supports integers, floats, strings, Python objects, etc

# import pandas as pd

In [23]:
import pandas as pd

# 1. Create Dataframe

- using dictionary of column name and list of values

- DataFrame function
Important parameters :: 
DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

In [30]:
df=pd.DataFrame(
    {
        "col1":[1,2,3],
        "col2":[100,200,300],
        "col3":[1000,2000,3000],
    },
    index=[1,2,3] # row labels
    )

In [31]:
df.head()

Unnamed: 0,col1,col2,col3
1,1,100,1000
2,2,200,2000
3,3,300,3000


- Create dataframe using list of lists

In [32]:
df=pd.DataFrame(
    [[1,100,1000],
    [2,200,2000],
    [3,300,3000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [33]:
df.head()

Unnamed: 0,a,b,c
1,1,100,1000
2,2,200,2000
3,3,300,3000


- Create dataframe by loading data from CSV file

    - read_csv function
    Important parameters :: 
    filepath_or_buffer , sep, header, names, index_col,
    skipinitialspace, skiprows, skipfooter, nrows, na_values,
    parse_dates, date_format, compression, 

In [5]:
df=pd.read_csv("./penguine/penguins_size.csv")

In [6]:
df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


# Change Layout of Dataframe

- melt
- concat

### melt function 

gather columns into rows

every column name and every value in the row is put seperately as shown in following example

Important parameters:: melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)


In [36]:
df=pd.DataFrame(
    [[1,100,1000],
    [2,200,2000],
    [3,300,3000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [37]:
df

Unnamed: 0,a,b,c
1,1,100,1000
2,2,200,2000
3,3,300,3000


In [38]:
df.shape

(3, 3)

In [39]:
df_melt = pd.melt(df)
df_melt.shape

(9, 2)

In [17]:
df_melt.head(9)

Unnamed: 0,variable,value
0,a,1
1,a,2
2,a,3
3,b,100
4,b,200
5,b,300
6,c,1000
7,c,2000
8,c,3000


### Concat function

    - Important parameters
    concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)
    
    -Append rows to Dataframe

    Two dataframes with same column names can be used to concat rows and form new dataframe

In [47]:
df1=pd.DataFrame(
    [[1,100,1000],
    [2,200,2000],
    [3,300,3000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [48]:
df1.head()

Unnamed: 0,a,b,c
1,1,100,1000
2,2,200,2000
3,3,300,3000


In [56]:
df2=pd.DataFrame(
    [[4,400,4000],
    [5,500,5000],
    [6,600,6000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )
df2.head()

Unnamed: 0,a,b,c
1,4,400,4000
2,5,500,5000
3,6,600,6000


In [57]:
pd.concat([df1,df2])

Unnamed: 0,a,b,c
1,1,100,1000
2,2,200,2000
3,3,300,3000
1,4,400,4000
2,5,500,5000
3,6,600,6000


    -Append Columns to a Dataframe

    Two dataframes with same row indexes can be used to concat columns and form a new dataframe

In [58]:
df1=pd.DataFrame(
    [[1,100,1000],
    [2,200,2000],
    [3,300,3000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [59]:
df2=pd.DataFrame(
    [[4,400,4000],
    [5,500,5000],
    [6,600,6000]],
    index=[1,2,3],
    columns=["d","e","f"]
    )

In [60]:
pd.concat([df1,df2],axis=1)

Unnamed: 0,a,b,c,d,e,f
1,1,100,1000,4,400,4000
2,2,200,2000,5,500,5000
3,3,300,3000,6,600,6000


# Rename

- Used to rename one or more columns

- it returns copy of changed data frame (when inplace=False)

- To modify original dataframe set inplace=True

- Function / dict values must be unique (1-to-1). 

- Labels not contained in a dict / Series will be left as-is. 

- Extra labels listed don’t throw an error. 

In [4]:
df=pd.DataFrame(
    [[1,100,1000],
    [2,200,2000],
    [3,300,3000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [5]:
df.columns

Index(['a', 'b', 'c'], dtype='object')

In [7]:
df_new = df.rename(columns = {'a':'First'})

In [8]:
df_new.columns

Index(['First', 'b', 'c'], dtype='object')

In [9]:
df_new.rename(columns = {'b':'Second','c':'Third'}, inplace=True)

In [10]:
df_new.columns

Index(['First', 'Second', 'Third'], dtype='object')

# Sort

- sort_index function

sort_index(*, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)

Sort object by labels (along an axis).

Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None

In [21]:
df=pd.DataFrame(
    [[1,100,1000],
    [2,200,2000],
    [3,300,3000]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [22]:
df.head()

Unnamed: 0,a,b,c
1,1,100,1000
2,2,200,2000
3,3,300,3000


In [20]:
df_new = df.sort_index(ascending=False)

In [16]:
df_new.head()

Unnamed: 0,a,b,c
3,3,300,3000
2,2,200,2000
1,1,100,1000


In [17]:
df_new = df.sort_index(ascending=True)

In [18]:
df_new.head()

Unnamed: 0,a,b,c
1,1,100,1000
2,2,200,2000
3,3,300,3000


- sort_values function

sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

Sort by the values along either axis.


In [37]:
df=pd.DataFrame(
    [["abcd",55,987],
    ["sample",22,451],
    ["example",45,123]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [26]:
df.head()

Unnamed: 0,a,b,c
1,1,55,987
2,2,22,451
3,3,45,123


In [27]:
df_sorted = df.sort_values(by='b')

In [28]:
df_sorted.head()

Unnamed: 0,a,b,c
2,2,22,451
3,3,45,123
1,1,55,987


In [29]:
df_sorted = df.sort_values('c')

In [30]:
df_sorted.head()

Unnamed: 0,a,b,c
3,3,45,123
2,2,22,451
1,1,55,987


    - Sort_values using key function
    
    This happens in stages
    stage 1: each column which needs to be sorted is sent to key function
    stage 2: in key function every value inside the column is processed and returned the key
    stage 3: using key of each value complete column is sorted
    
    In below example we are sorting column 'a' values using second letter of each string in the column
    so, first apply_columnwise function is given as key function
    apply_columnwise will get the complete column 'a' as parameter
    Then inside apply_columnwise function, we extract second letter from each string using get_second_letter function
    Finally using these keys (second letter of each string) complete column 'a' gets sorted

In [44]:
df=pd.DataFrame(
    [["abcd",55,987],
    ["sample",22,451],
    ["example",45,123]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [45]:
def get_second_letter(string):
    return string[1]
def apply_columnwise(column_series):
    return column_series.apply(get_second_letter)

df_sorted = df.sort_values('a', key=apply_columnwise)

In [46]:
df_sorted.head()

Unnamed: 0,a,b,c
2,sample,22,451
1,abcd,55,987
3,example,45,123


# Drop

- drop function

drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by directly specifying index or column names.

row labels means index numbers 

column labels means column names

In [47]:
df=pd.DataFrame(
    [["abcd",55,987],
    ["sample",22,451],
    ["example",45,123]],
    index=[1,2,3],
    columns=["a","b","c"]
    )

In [48]:
# drop row with index 1, by default axis=0 means rows will be dropped
df.drop([1])

Unnamed: 0,a,b,c
2,sample,22,451
3,example,45,123


In [49]:
# drop column 'b', axis=1 means columns will be dropped
df.drop(['b'],axis=1)

Unnamed: 0,a,c
1,abcd,987
2,sample,451
3,example,123


 - Errors in drop
    
        - cannot drop column by index
        
        There will be keyerror: [0] not found on axis
        because it tries to match the column names with 0

In [None]:
df.drop(index=[0],axis=1)
# ERROR