# Introduction to Pandas library

Pandas is an open-source, BSD - licensed library. It helps in exploratory data analysis. 

It provides easy-to-use data structures and data analysis tools for the python programming language.

> NumPy and Pandas are must for EDA

In [2]:
import pandas as pd

# importing numpy
import numpy as np

### Terminology 

#### Dataframe 
Once the data set is loaded from excel, pandas stores data in form of Dataframe. It is a combination of both rows and columns and it looks exactly like the excel representation.

**Dataframe will be a 2D array**

##### Playing with dataframe

Creating a data using the numpy.arange() function using the re-shape function to make it look like a 2D array.

We can create a simple dataframe using the `pd.DataFrame(data, index=[row indexes])

In [5]:
# index - row names
# column - col names

df = pd.DataFrame(np.arange(0,20).reshape(5,4), index=['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], columns=['Col1','Col2','Col3','Col4']) 

df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


#### head() function 

Prints the first five rows of the DataFrame

In [6]:
df.head() # prints the first five rows of dataframe

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


#### df.to_csv(fileName)

Converts the dataFrame into csv file and stores in the current directory.

In [9]:
df.to_csv("file.csv") # file will get stored in current directory

### Accessing the elements using .loc and .iloc

`.loc` - location
`.iloc` - indexed based location

> DataFrame - combination of rows and columns and it atleast contain more than one row and one column

> DataSeries - any one column or any one row is considered as data series

In [12]:
df.loc['Row1'] 

Col1    0
Col2    1
Col3    2
Col4    3
Name: Row1, dtype: int64

In [14]:
type(df.loc['Row1'])  # prints as series as we are accesssing either single row or column

pandas.core.series.Series

#### Indexed basd - iloc

Similar to accessing elements based on index using numPy library

In [15]:
df

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [16]:
df.iloc[0:,0:] # prints the whole dataframe

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [18]:
print(df.iloc[0:1, 0:1])

print(type(df.iloc[0:1,0:1])) # since we have >1 row and >1 col

      Col1
Row1     0
<class 'pandas.core.frame.DataFrame'>


In [20]:
print(df.iloc[0:1, 0])

print(type(df.iloc[0:1,0])) # considered as dataseries since only one row

Row1    0
Name: Col1, dtype: int64
<class 'pandas.core.series.Series'>


#### Converting dataframe into arrays 

We can simply use the `.values` method where it delete the row name and column names

In [22]:
df.iloc[:,:].values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

##### Using `.isnull()` values

Finds the null values in the given dataframe

In [24]:
df.isnull().sum()

Col1    0
Col2    0
Col3    0
Col4    0
dtype: int64

`.value_counts()` function

Returns the count of occurrences of the value in the dataframe. Works for column.

In [26]:
df.head()

Unnamed: 0,Col1,Col2,Col3,Col4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [28]:
df['Col1'].value_counts() 

Col1
0     1
4     1
8     1
12    1
16    1
Name: count, dtype: int64

#### unique() function

Returns the unique element in the particular column

In [30]:
df['Col1'].unique()

array([ 0,  4,  8, 12, 16])

#### Printing columns without using indexing

In [31]:
df['Col1'] 

Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Col1, dtype: int64

In [32]:
df[['Col1','Col2']]

Unnamed: 0,Col1,Col2
Row1,0,1
Row2,4,5
Row3,8,9
Row4,12,13
Row5,16,17
