### Pandas - DataFrame and Series

Pandas is a powerful data manipulation library in Python, used for data analysis and data cleaning. It provides two primary data structures: Series and DataFrame.

Series is a one-dimensional array-like object.

DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous data structure with labelled axes (rows and columns).

In [1]:
import  pandas as pd

Series - It is a 1D array-like object that can hold any data type, similar to a column in a table

In [4]:
data = [1, 2, 3, 4, 5]
series = pd.Series(data)

In [6]:
print(series)
print(type(series))

0    1
1    2
2    3
3    4
4    5
dtype: int64
<class 'pandas.core.series.Series'>


Create a Series from Dictionary

In [7]:
dictionary = {'a' : 1, 'b' : 2, 'c' : 3}
series_dict = pd.Series(dictionary)

In [8]:
print(series_dict)

a    1
b    2
c    3
dtype: int64


In [9]:
index = ['a', 'b', 'c', 'd', 'e']
pd.Series(data, index = index)

a    1
b    2
c    3
d    4
e    5
dtype: int64

DataFrame

In [10]:
dataframe = {
    'name' : ['Anna', 'Henry', 'Paul'],
    'age' : [25, 20, 45],
    'city' : ['Chennai', 'Manila', 'Singapore']
}

In [12]:
df = pd.DataFrame(dataframe)
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [13]:
print(df)

    name  age       city
0   Anna   25    Chennai
1  Henry   20     Manila
2   Paul   45  Singapore


In [14]:
import numpy as np
np.array(df)

array([['Anna', 25, 'Chennai'],
       ['Henry', 20, 'Manila'],
       ['Paul', 45, 'Singapore']], dtype=object)

Create a DataFrame from a List of Dictionaries

In [15]:
dict_list = [
    {'name' : 'Anna', 'age' : 25, 'city' : 'Chennai'},
    {'name' : 'Henry', 'age' : 20, 'city' : 'Manila'},
    {'name' : 'Paul', 'age' : 45, 'city' : 'Singapore'}
]

In [16]:
df_dict_list = pd.DataFrame(dict_list)
print(type(df_dict_list))
print(df_dict_list)

<class 'pandas.core.frame.DataFrame'>
    name  age       city
0   Anna   25    Chennai
1  Henry   20     Manila
2   Paul   45  Singapore


Accessing Data from DataFrame

In [24]:
df_dict_list['name']

0     Anna
1    Henry
2     Paul
Name: name, dtype: object

In [25]:
df_dict_list.loc[1]

name     Henry
age         20
city    Manila
Name: 1, dtype: object

In [27]:
df_dict_list.iloc[0]

name       Anna
age          25
city    Chennai
Name: 0, dtype: object

In [None]:
# load csv data
df_csv = pd.read_csv('business.csv')


In [23]:
df_csv.tail(5)

Unnamed: 0,description,industry,level,size,line_code,value
6622,Number of years business is dealing with main ...,Education & training,1,total,D2100.04,468
6623,Number of years business is dealing with main ...,Health care & social assistance,1,total,D2100.04,1320
6624,Number of years business is dealing with main ...,Arts & recreation services,1,total,D2100.04,264
6625,Number of years business is dealing with main ...,Other services,1,total,D2100.04,936
6626,Number of years business is dealing with main ...,total,0,total,D2100.04,24012


In [28]:
df_csv.iloc[6]

description    Type of outstanding debt: bank overdrafts
industry                              Commercial fishing
level                                                  2
size                                               total
line_code                                          D0201
value                                                 24
Name: 6, dtype: object

In [34]:
df_csv.iloc[6][0:3]

description    Type of outstanding debt: bank overdrafts
industry                              Commercial fishing
level                                                  2
Name: 6, dtype: object

In [35]:
df_csv.at[6, 'industry']

'Commercial fishing'

Data Manipulation with DataFrame

In [39]:
df['salary'] = [70000, 73000, 100000]

In [None]:
# df.drop('Salary')
# KeyError: "['Salary'] not found in axis"

Unnamed: 0,name,age,city,Salary
0,Anna,25,Chennai,70000
1,Henry,20,Manila,73000
2,Paul,45,Singapore,100000


In [41]:
df.drop('Salary', axis = 1)

Unnamed: 0,name,age,city,salary
0,Anna,25,Chennai,70000
1,Henry,20,Manila,73000
2,Paul,45,Singapore,100000


In [42]:
df

Unnamed: 0,name,age,city,Salary,salary
0,Anna,25,Chennai,70000,70000
1,Henry,20,Manila,73000,73000
2,Paul,45,Singapore,100000,100000


In [43]:
df.drop('Salary', axis = 1, inplace = True)

In [44]:
df

Unnamed: 0,name,age,city,salary
0,Anna,25,Chennai,70000
1,Henry,20,Manila,73000
2,Paul,45,Singapore,100000


Add age to column

In [45]:
df['age'] = df['age'] + 1

In [46]:
df

Unnamed: 0,name,age,city,salary
0,Anna,26,Chennai,70000
1,Henry,21,Manila,73000
2,Paul,46,Singapore,100000


In [47]:
df_csv.dtypes

description    object
industry       object
level           int64
size           object
line_code      object
value           int64
dtype: object

In [56]:
# Stats analysis
df_csv.describe()

Unnamed: 0,level,value
count,6627.0,6627.0
mean,1.404255,421.937981
std,0.673548,2071.697633
min,0.0,0.0
25%,1.0,3.0
50%,2.0,27.0
75%,2.0,153.0
max,2.0,42840.0
