# Pandas

Real datasets are not purely numerical, rather semistructured. Pandas is a Python library for data manipulation and analysis. 

License: open source, BSD (can be used commercially)

References: there are many, but this one is very systematic.
- https://www.tutorialspoint.com/python_pandas/index.htm

## Data structures in Pandas

Pandas offer three kinds of data structures, namely 
- Series 
- Dataframe
- Panel

## Pandas.Series

Let us start by creating an empty series. It creates a series with empty data, and default datatype (dtype) float64.

In [None]:
import pandas as pd

s = pd.Series()
print(s)

### A Series from numpy ndarray

In [None]:
import numpy as np
import pandas as pd

s1 = pd.Series(np.array(["Virat","Rohit","Jasprit","Ajinkya","Rahul","Rishabh","MS"]))
print(s1)

Notes:
1. The index 0 to $n-1$ (where $n$ is the length of the array) are automatically generated. 
2. Pandas would try to infer the data type, here it did not. It is inferred as object (general).

In [None]:
s2 = pd.Series(np.array([5,2,10,4,3,8,2]))
print(s2)

The integer datatype is inferred here. 

### Using index of our choice

Instead of Pandas creating the default index, we can pass the list of indices. However, the index values must be unique, hashable, and preferably the same length as the data (error, if the data is an ndarray and the lengths do not match).

In [None]:
s1 = pd.Series(np.array(["Virat","Rohit","Jasprit","MS"]), index=[18,45, 93, 7])
print(s1)

# Experimenting with non-unique keys
# In this case both are retained
#print(s1[7])

The indices need not be integers only. 

In [None]:
s1 = pd.Series(np.array(["Virat","Rohit","Jasprit","MS"]), index=[18,45, "tbd",7])
print(s1)

## A Series from a dictionary 

If a dictionary is passed as the data, then the keys of the dictionary are taken as the indices (they are unique), and the values as the data. 

In [None]:
s1 = pd.Series({18: "Virat", 45: "Rohit", 93: "Jasprit", 7: "MS"})
print(s1)

However, if a list of indices are still passed, then the series contains only the indices that are specified, and the other dictionary elements are not included in the series.

In [None]:
s1 = pd.Series({18: "Virat", 45: "Rohit", 93: "Jasprit", 7: "MS"}, index=[45, 93, 7])
print(s1)

 If an index is not present as a key in the dictionary, then its value is set as not defined. 

In [None]:
s1 = pd.Series({18: "Virat", 45: "Rohit", 93: "Jasprit", 7: "MS"}, index=[45, 93, 7, 100])
print(s1)

## A Series from a scalar

We can create a series by repeating a scalar value for every index specified. 

In [None]:
s1 = pd.Series(3, index=[45, 93, 7, 100])
print(s1)

# Accessing elements in a series

Elements can be accessed in a series by its index, similar to arrays (with some differences). 

In [None]:
# If the indices of a series are integers, then elements are accessed by the integer indices.
s1 = pd.Series({18: "Virat", 45: "Rohit", 93: "Jasprit", 7: "MS"}, index=[18, 45, 93, 7])
#print(s1[93])
print(s1[7])

In [None]:
# If the indices of a series are strings, then elements can be accessed by the string indices.
s1 = pd.Series({"c": "Virat", "vc": "Rohit", "b": "Jasprit", "exc": "MS"})
print(s1["c"])
print(s1["vc"])
print()
# Or by given a list of multiple indices
print(s1[["vc","c"]])

In [None]:
print(s1)

In [None]:
# But they can be accessed using their ordered integer indices too, including slicing by index, negative index
#print(s1[0])
#print(s1[2])
print(s1[-1])
#print(s1[:2])

### Changing data

The data at any index can be set to a different value. 

In [None]:
# Set data at index "b" to "Bhuvi"
s1["b"] = "Bhuvi"

print(s1)

In [None]:
s1 = pd.Series(np.array(["Virat","Rohit","Jasprit","MS"]), index=[18,45, 93, 7])
#print(s1)

s1[20] = "Shreyas"
print(s1)

### Note: size-mutability

All documentations mention that Pandas Series is size-immutable, that is, the length cannot be changed once declared. However, adding new elements work. Need to check: is it effectively using Series.append(another series)?

In [None]:
s1["s"] = "Jaddu"

print(s1)

# Dataframe

A dataframe is a 2D structure, similar to a table. A dataframe is very widely used for real world data. 

A dataframe can be created as empty, or from a list, or from a list of lists. 

In [None]:
# Empty dataframe
df = pd.DataFrame()
print(df)

In [None]:
# From a list 
df = pd.DataFrame(np.array([7,8,1,2,9,3]))
print(df)

The rows and column indices are automatically generated this way. 

In [None]:
# From a list of lists
df = pd.DataFrame(np.array([[7,8,1],[2,9,3]]))
print(df)

Essentially this is like a wrapper around numpy 2-D array, but with row and column indices attached to it explicitly. 

### Names for columns

The columns are intuitively treated as attributes (of tables), so they can have names. 

In [None]:
data = [["Virat",18],["Rohit",45],["Jasprit",93]]
df = pd.DataFrame(data,columns=["Name","Jersey No."])
print(df)

### DataFrame from dictionaries

A dataframe can be created from one or more ndarrays or lists of the same length, each one used as a column. 

#### From dictionary of lists

In [None]:
# From dictionary of lists

import pandas as pd

data = {"Name": ["Virat", "Rohit", "Jasprit", "MS"], "Jersey No.":[18, 45, 93, 7]}
df = pd.DataFrame(data)

print(df)

#### From list of dictionaries 

In [None]:
data = [{"Name": "Virat", "Jersey No.": 18}, {"Name": "Rohit", "Jersey No.": 45}, {"Name": "Rahul", "Type": "Wicketkeeper"}]
df = pd.DataFrame(data)

print(df)

### DataFrame with row and column indices

A DataFrame may have custom row indices and column names. 

In [None]:
data = {"Name": ["Virat", "Rohit", "Jasprit", "MS"], "Jersey No.":[18, 45, 93, 7]}
df = pd.DataFrame(data, index=[3,1,11,5])
#df = pd.DataFrame(data)
print(df)

## DataFrames operations

### Selecting a row

A row can be selected by passing the row index to the loc function.

In [None]:
print(df.loc[[1,3]])

Or, by passing the row position to the iloc function.

In [None]:
print(df.iloc[[1,3]])

### Slicing a row

A slice of rows can be selected using a range or array of positions (not indices).

In [None]:
print(df[0:2])

### Selecting columns

Selection of columns work by column names. 

In [None]:
print(df["Name"])

### Adding a column

We can add a new column to an existing dataframe. 

In [None]:
df["Average"] = pd.Series([59, 49, 5, 51], index=[3,1,11,5])
#df["Average2"] = pd.Series([53, 40, 5, 35])
print(df)

### Deleting a column using the pop function

In [None]:
df.pop("Average")
print(df)

# Or, use del df["Average"]

## Adding a row by append

One dataframe with the same columns can be appended to another. It becomes equivalent to adding more data. 

In [None]:
df1 = pd.DataFrame({"Name": ["Rahul", "Rishabh"], "Jersey No.":[1, 55]}, index=[2,6])
df = df.append(df1)

print(df)

### Deleting a row by drop

In [None]:
df = df.drop(6)

print(df)