**Pandas**
Pandas is a Python package for handling labelled data used dominantly in data analysis and manipulation.

Pandas can be installed from PyPI (Using pip), or from Anaconda distribution.

Pandas has following data structures:
1. DataFrame
2. Series
3. ndarray

**Pandas DataFrame**

Pandas DataFrame is a 2-dimensional labeled data structure with columns of different types. It can be thought of as a matrix but having data with different types.

Characteristics:
1. Multiple rows and columns of data
2. Each row represents a sample of data
3. Each row represents a variable/dimension of dataset
4. Usually, all the data in a column is of the same type.


In [None]:
#Import the Pandas package
import pandas as pd

**Creating DataFrames**


In [None]:

#1. Manually entering the data
df2 = pd.DataFrame(
    {
        "SLNO" : [1,2,3,4,5],
        "Months" : ['Jan', 'Feb', 'Mar', 'Apr', 'May'],
        "Col3" : [0.11, 0.23, 0.56, 0.12, 0.87],
        "Bool_Col" : [True, True, False, True, False]
    }
)

df2

In [None]:
#2. From CSV files (.csv)
df = pd.read_csv("../input/train.csv")

#Similar fucntion available for spreadsheet as well (XLS)

print(type(df))

In [None]:
#Dimension
df.ndim #2-dimensional

In [None]:
#Shape
df.shape #891 rows x 12 columns

In [None]:
#Total number of cells
df.size # 891*12 = 10692

**Renaming the columns**

In [None]:
#Columns
df.columns

In [None]:
#Change column name
df2.columns

In [None]:
#1. 
df2.columns = ['index', 'mon', 'nums', 'bools']

df2.columns

In [None]:
#2. 
df2 = df2.rename(columns={'mon':'Month'})

df2.columns

In [None]:
#inplace parameter will change the dataframe without assignment
df2.rename(columns={"Month" : "months",
                   "nums" : "Numbers"}, 
           inplace=True)

df2.columns

**Previewing**

In [None]:
#First few rows
df.head(6)

#Indexing starts with 0

In [None]:
#Last few rows
df.tail(6)

In [None]:
#Datatype of data in each column
df2.dtypes

In [None]:
#Summarizing
df.describe()

#Describe funtion gives statistical summary column-wise of numerical columns including quartile values (25%, 50%, 75%)

In [None]:
df.info()

#'Age' has only 714 values, rest are NAs: null values/missing values

**Selecting and Manipulating Data**

**Selecting columns**

There are three main methods of selecting columns in pandas:
1. using a dot notation, e.g. data.column_name
2. using square braces and the name of the column as a string, e.g. data['column_name']
3. using numeric indexing and the iloc selector data.iloc[:, <column_number>]

In [None]:
#1. using dot notation
df.Fare.head(4)

In [None]:
#2. using square braces
df['Fare'].head(4)

In [None]:
#3. using iloc
df.iloc[:,9].head(4) #9th column is 'Fare'

In [None]:
#iloc[:,9] is same as iloc[0:890, 9]
df.iloc[0:890,9].head(4)

**Selecting multiple columns**

In [None]:
df[['Fare', 'Age', 'Sex']].head(4)

In [None]:
#Multiple columns selection using iloc
df.iloc[:4, [9, 5, 4]]


In [None]:
#Can select rows as well using iloc
df.iloc[2:8, 9] #From row 2 to row 7; 8 excluded

In [None]:
#Row selection can also be used with square braces
df['Fare'][2:8]

In [None]:
#Or with dot notation
df.Fare[2:8]

When a column is selected using any of these methods, a 'pandas.Series' is the resulting datatype. A pandas series is a one-dimensional set of data. 
Basic operations that can be carried out on these Series of data are:
1. Summarizing: sum(), mean(), count(), median(), min(), max(), unique(), nunique() 
2. replacing missing values (.fillna(new_value)).

In [None]:
#1. summing
df.Fare.sum()

In [None]:
#2. averaging
df.Fare.mean()

In [None]:
#3. counting
df.Fare.count()

In [None]:
#4. median
df.Fare.median()

In [None]:
#5. minimum value
df.Fare.min()

In [None]:
#6. Maximum value
df.Fare.max()

In [None]:
#7. unique values
df.Sex.unique()

In [None]:
#8. Number of unique values
df.Sex.nunique()

In [None]:
#9. Missing values imputation using fillna function
df.Age.count()
#Total 891 rows, only 714 present. Rest are NAs (missing values)

In [None]:
df.Age = df.Age.fillna(20)

df.Age.count() #All missing values are filled with '20', new count is 891.

**Row selection using iloc**

The basic methods to get your heads around are:
1. numeric row selection using the iloc selector, e.g. data.iloc[2:10, :] – from 2 to 10.
2. label-based row selection using the loc selector (this is only applicably if you have set an “index” on dataframe. e.g. data.loc[[2,3,4,10,11,12], :]
3. logical-based row selection using evaluated statements, e.g. data[data["Area"] == "Ireland"] – select the rows where Area value is ‘Ireland’.

In [None]:
#1. 
df.iloc[2:6, :] #Same as df.iloc[2:6,]

In [None]:
#2.
df.iloc[[2,3,4,10,20], :]

In [None]:
#3. Logical selection
df[ df['Sex'] == 'male' ].head(4)

**Deleting Columns**

In [None]:
# Deleting columns
    
# Delete a column from the dataframe
df = df.drop("Pclass", axis=1)
    
# alternatively, delete columns using the columns parameter of drop
df = df.drop(columns="SibSp")

In [None]:
df.columns

In [None]:
# Delete the Area column from the dataframe in place
# Note that the original 'data' object is changed when inplace=True
df.drop("Parch", axis=1, inplace=True)

# Delete multiple columns from the dataframe
df = df.drop(["Ticket", "Cabin", "Name"], axis=1)

In [None]:
df.columns

In [None]:
df.head()

**Deleting rows**

In [None]:
# Delete the rows with labels 0,1,5
df = df.drop([0,1,2], axis=0)
df.head(5)

In [None]:
# Delete the rows with label "male"
# For label-based deletion, set the index first on the dataframe:
df = df.set_index("Sex")
df = df.drop("male", axis=0) # Delete all rows with label "male"

df.head(8)

In [None]:
# Delete the first five rows using iloc selector
df = df.iloc[5:,]

df.head(8)

**Exporting and Saving Pandas DataFrames**

In [None]:
df.head()

In [None]:
df.to_csv("new_train.csv", index=False)

Similar fucntion available for spreadsheet as well (XLS)