# Pandas

***

### Why Pandas

Pandas is one of the most powerful data manipulation tools out there but when a data scientist can leverage the power of indexing to his advantage, it makes pandas the best data manipulation tool out there!

### DataFrame Basics

*Dataframe* is a main object in Pandas. What’s cool about Pandas is that it takes data (like a CSV or JSON file, or a SQL database) and creates a Python object with **rows** and **columns**. It is used to reprsent data with rows and columns (tabular or excel spreadsheet like data). 



In [None]:
from IPython.display import Image 
Image("EDA.png")

***

## 6 Parts of Pandas
1. Importing Data and Reading Data
2. Summarizing Data (Statistics)  
3. Manipulating Data / Cleaning Data
4. Selecting Data / Subsetting Data
5. Grouping and Filtering Data
6. Combining Datasets

# Getting Started

## Import Libraries


**Pandas:** Use for data manipulation and data analysis.
<br>
**Numpy:** fundamental package for scientific computing with Python.
<br>
**Matplotlib and Seaborn :** For plotting and visualization.
<br>
**Scikit-learn :** For the data preprocessing techniques and algorithms.

In [None]:
# Importing required Packages
import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
from IPython.display import Image 


## Pandas Data Structure<br>


<li>Series
<li>DataFrame


### What is Series 

### Create DataFrames

## Importing Data


In [None]:
Image("files_read.png")

### Import .csv files from a local machine

Use the file path: file_path = "/home//Desktop/Project/"

In [None]:
# Read Loan Dataset


### Importing Files from a web url

In [None]:
url = 'https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/HR_comma_sep.csv.txt'

df_hr= pd.read_csv(url)
df_hr

### Get the data from a excel

In [None]:
# df_sales = pd.read_excel('data-science-complete-tutorial/Data/sales_info.xlsx')

# df_sales

### Exploring pd.read_csv()



#### <b> sep = "," by default  :</b>  

Specify separator if it not a comma 

In [None]:
Image("sep.png")

 

Use sep = “|”  as shown below

In [None]:
Image("sep1.png")

#### <b> header :</b>  

Use pandas read_csv header to specify which line in your data is to be considered as header.

In [None]:
Image("data_header.png")

In [None]:
Image("header_0.png")

**header = 1 means consider second line of the dataset as header.**

In [None]:
Image("header_1.png")

#### <b> index_col :</b>  

Use this argument to specify the row labels to use. If you set index_col to 0, then the first column of the dataframe will become the row label.

In [None]:
Image("index_col.png")

#### <b> use_cols :</b>  

Use pandas usecols when you want to load specific columns into dataframe. When your input dataset contains a large number of columns, and you want to load a subset of those columns into a dataframe , then usecols will be very useful

In [None]:
Image("usecols.png")

#### <b> nrows :</b>  

If you want to read a limited number of rows, instead of all the rows in a dataset, use nrows. This is especially useful when reading a large file into a pandas dataframe.

In [None]:
Image("nrows.png")

In [None]:
### Standard practice
data_original = df.copy()

# High Level Data Understanding

### Functions

***
These functions are the most common tools used when trying to summarize your data

- **df.head(n)** — Returns the first n rows of your DataFrame. Having a blank argument will display the first 5 by default
- **df.tail(n)** — Returns the last n rows of your DataFrame. Having a blank argument will display the last 5 by default
- **df.shape()** — Displays the number of rows and columns in your DataFrame
- **df.describe()** — Dispalys a statistical summary for numerical columns
- **df.describe(include=['object'])** —  Displays a statistical summary for all object (string) columns
- **df.describe(include='all')**  —  Displays a statistical summary for all columns
- **df.mean()** — Returns the mean of all columns
- **df.median()** — Returns the median of all columns
- **df.std()** — Returns the standard deviation of all columns
- **df.max()** — Returns the highest value in each column
- **df.min()** — Returns the lowest value in each column
- **df.dtypes** - Returns the data types of each colulmn


### See the first 5 entries

<li>data.head()

### See the last 5 entries

<li> data.tail()

### What is the number of observations & features in the dataset? 

<li> data.shape

#### Shape of Dataframe

#will give you both (observations/rows, columns)

#### No. of observations(Rows)

#will give you only the observations/rows number

#### No. of Features(Columns)

#will give you the # features/columns number

###  Print the name of all the columns.

We have 12 independent variables and 1 target variable, i.e. Loan_Status in the loan_data dataset

In [None]:
Image('Datacolumns.PNG')

###  What is the name of 3rd column?

### How is the dataset indexed?

### Datatype of Features

<li><b>object: </b> Object format means variables are categorical. Categorical variables in our dataset are: Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Property_Area, Loan_Status<br><br>
<li> <b>int64: </b> It represents the integer variables. ApplicantIncome is of this format.<br><br>
<li> <b>float64: </b> It represents the variable which have some decimal values involved. They are also numerical variables. Numerical variables in our dataset are: CoapplicantIncome, LoanAmount, Loan_Amount_Term, and Credit_History<br>

###  Features information

### Describing Data

# Low Level Data Understanding

## Univariate Analysis

### Categorical Feature

In [None]:
###Education col



#### Unique value in a column

#### No. of unique value in a column

#### Bar chart

The loan of 422(around 69%) people out of 614 was approved.

Different types of variables are Categorical, ordinal and numerical.

**Categorical features:** These features have categories (Gender, Married, Self_Employed, Credit_History, Loan_Status)

**Ordinal features:** Variables in categorical features having some order involved (Dependents, Education, Property_Area)

**Numerical features:** These features have numerical values (ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term)

#### Pie chart

### Numeric Feature

#### Histogram

#### Density plot

#### Box Plot


It can be inferred that most of the data in the distribution of applicant income is towards left which means it is not normally distributed. We will try to make it normal in later sections as algorithms works better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

In [3]:

import pandas as pd
url='https://raw.githubusercontent.com/Shreyas3108/house-price-prediction/master/kc_house_data.csv'    
df =pd.read_csv(url)
df

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,...,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,...,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,...,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,...,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287
