# Pandas Library 
Pandas is a popular open-source Python package used for data manipulation, analysis, and cleaning. It provides easy-to-use tools to conduct various operations on data sets in the form of tabular data (i.e., rows and columns). It can handle different types of data, including numerical, categorical, and textual data.

Some of Pandas' powerful features include:

- Reading and writing data from different file formats, such as CSV, Excel, SQL databases, JSON, HTML, etc.
- Handling missing values and filling them with appropriate data.
- Selecting, filtering, and manipulating data based on different criteria.
- Merging and joining multiple datasets together.
- Grouping and summarizing data based on different variables.
- Creating visualizations of data using built-in methods.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np

### Pandas Series Object
In Pandas, a Series is a one-dimensional labeled array that can hold data of any type (integer, float, string, Python objects, etc.). It provides the foundation for Pandas' more complex data structures such as DataFrame.

A Series consists of two parts:

- `Index`: It is a sequence of labels that helps to identify each element of the Series. The index can be customized or set to default (i.e., 0 to n-1, where n is the length of the Series).
- `Data`: It is a collection of values that can be of any data type. The data is aligned with the index labels so that the elements are easily accessible.

In [2]:
# To create a Series, you can use the pd.Series() constructor, which takes various parameters like data, index, dtype, etc. 
# A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows
data = [5, 10, 15, 20, 25]
series = pd.Series(data)
print(series)

0     5
1    10
2    15
3    20
4    25
dtype: int64


The `Series` wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes. The values are simply a familiar NumPy array.

- The essential difference from a NumPy array and `Series` is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas `Series` has an explicitly defined index associated with the values. This explicit index definition gives the `Series` object additional capabilities.
- In this way, you can think of a Pandas `Series` a bit like a specialization of a Python `dictionary`.

In [3]:
# For example, the index need not be an integer, but can consist of values of any desired
# type. For example, if we wish, we can use strings as an index:
data = pd.Series(data = [5, 10, 15, 20, 25], 
                 index = ['a', 'b', 'c', 'd', 'e'])
print(data)

# And the item access works as expected
print(data[['b', 'a']])

# We can make the Series-as-dictionary analogy even more clear by constructing a
# Series object directly from a Python dictionary:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population)

# Unlike a dictionary, though, the Series also supports array-style operations such as slicing:
population['California':'New York']

a     5
b    10
c    15
d    20
e    25
dtype: int64
b    10
a     5
dtype: int64
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64


California    38332521
Texas         26448193
New York      19651127
dtype: int64

There are different ways to create a `Series` object. 
- Data can be a list or NumPy array, in which case index defaults to an integer sequence.
- Data can be a scalar, which is repeated to fill the specified index.
- Data can be a dictionary, in which index defaults to the sorted dictionary keys.
- In each case, the index can be explicitly set if a different result is preferred

In [12]:
# Creating Series object from a list
pd.Series(data = [2, 4, 6])

# Creating Series object from a scalar
pd.Series(data = 5, index=[100, 200, 300])

# Creating Series object from a dictionary
pd.Series({2:'a', 1:'b', 3:'c'})

# The index can be explicitly set if a different result is preferred:
pd.Series({2:'a', 1:'b', 3:'c'}, index = [3, 2])

3    c
2    a
dtype: object

### DataFrame Object
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different data types. It is similar to a spreadsheet or SQL table, but with some powerful optimizations for working with data in Python. The main features of DataFrame are:

- Columns can be of different data types (e.g., numbers, strings, dates, and lists).
- Row and column labels allow for intuitive indexing.
- Arithmetic operations can be done on rows and columns.
- Missing data is handled gracefully.

The `pandas.DataFrame()` constructor function takes several arguments, some of the important ones are:

- `data`: This argument can be a variety of inputs like ndarray, series, map, lists, dict, constants, and also another DataFrame. It is required to create a DataFrame.
- `index`: It specifies the row labels of the DataFrame. The default value is None, and in that case, the row labels will be auto-generated as simple sequential integers starting from 0.
- `columns`: It specifies the column labels of the DataFrame. The default value is None, and in that case, the column labels will be auto-generated as simple sequential integers starting from 0.

In [13]:
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 32, 18, 47],
        'city':['NYC', 'LA', 'Chicago', 'Boston']}

df = pd.DataFrame(data)
print(df)

      name  age     city
0    Alice   25      NYC
1      Bob   32       LA
2  Charlie   18  Chicago
3    David   47   Boston


In [14]:
# Example of creating a DataFrame from python objects
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}

# Create Pandas series objects from the area and population dictionaries
population = pd.Series(population_dict)
area = pd.Series(area_dict)

# Create a Pandas DataFrame named states with columns for population and area
states = pd.DataFrame({'population': population, 'area': area})

# Print the resulting states DataFrame
print(states)

            population    area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995


Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. A Pandas DataFrame can be constructed in a variety of ways. Here we’ll give several examples.
- From a single `Series` object. A DataFrame is a collection of `Series` objects, and a singlecolumn DataFrame can be constructed from a single `Series`.
- From a list of dicts. Any list of dictionaries can be made into a DataFrame. We’ll use a simple list comprehension to create some data.
- From a dictionary of `Series` objects. As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well.
- From a two-dimensional NumPy array. Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names.

In [4]:
# Creating a dataframe from a single Series object
pd.DataFrame(population, columns = ['population'])

# Creating a dataframe from a list of dicts
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

# Creating a dataframe from a dictionary of Series objetcs
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
population = pd.Series(population_dict)
area = pd.Series(area_dict)
pd.DataFrame({'population': population, 'area': area})

# Creating a dataframe from a two-dimentional NumPy Array 
pd.DataFrame(np.random.rand(3, 2),
             columns = ['foo', 'bar'],
             index = ['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.181832,0.613387
b,0.765804,0.939247
c,0.64859,0.062196


### The Pandas Index Object
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set.
- The Index object in many ways operates like an array.
- Index objects also have many of the attributes familiar from NumPy arrays
- Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure

In [5]:
# For example, we can use standard Python indexing notation to retrieve values or slices:
ind = pd.Index([2, 3, 5, 7, 11])
print(ind[::2])

# Index objects also have many of the attributes familiar from NumPy arrays
print(ind.size, ind.shape, ind.ndim, ind.dtype)

# Unions, intersections, differences, and other combinations can be computed in a familiar way
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
indA | indB # union
indA ^ indB # symmetric difference

Int64Index([2, 5, 11], dtype='int64')

### Data Indexing and Selection 

#### Data Selection in Series 
A Series object acts in many ways like a onedimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.
- Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values.
- We can also use dictionary-like Python expressions and methods to examine the keys/indices and values
- Series objects can even be modified with a dictionary-like syntax.

In [13]:
# Creating Series object
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)

# Use dictionary-like Python expressions
print('a' in data)
print(data.keys()) 
print(list(data.items())) # Prints out a list of the items 

# Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value
data['e'] = 1.25
print(data)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
True
Index(['a', 'b', 'c', 'd'], dtype='object')
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64


- A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays—that is, slices, masking, and fancy indexing.

In [14]:
# Creating Series object
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)

# slicing by explicit index
data['a':'c']

# slicing by implicit integer index
data[0:2]

# masking
data[(data > 0.3) & (data < 0.8)]

# fancy indexing
data[['a', 'e']]

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


b    0.50
c    0.75
dtype: float64

- Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
    - First, the `loc` attribute allows indexing and slicing that always references the explicit
index.
    - The `iloc` attribute allows indexing and slicing that always references the implicit
Python-style index

In [16]:
# Consider the following Series object
s = pd.Series([10, 20, 30, 40], index=["a", "b", "c", "d"])

# To select the element with index label "b"
s.loc["b"]

# To select the element at the second position:
s.iloc[1]

# You can also use slicing notation with loc and iloc. For example, to select all elements from index label "b" to "d
s.loc["b":"d"]
s.iloc[1:4]

b    20
c    30
d    40
dtype: int64

### Read Files with Pandas
You can use the pandas.read_csv() function to read a CSV file with Pandas.

In [21]:
# Read a csv using Pandas
df = pd.read_csv('data/tips.csv')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251


#### Data Selection in DataFrames
Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

- The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name.
- Equivalently, we can use attribute-style access with column names that are strings
- This dictionary-style syntax can also be used to modify the object

In [32]:
# Access via dicitonary style
df['Payment ID']

# Access via attribute-style
df.total_bill

# Add a new column
df['tip_percentage'] = round((df['tip'] / df['total_bill']) * 100, 2)
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person,Payer Name,CC Number,Payment ID,tip_percentage
0,16.99,1.01,Female,No,Sun,Dinner,2,8.49,Christy Cunningham,3560325168603410,Sun2959,5.94
1,10.34,1.66,Male,No,Sun,Dinner,3,3.45,Douglas Tucker,4478071379779230,Sun4608,16.05
2,21.01,3.5,Male,No,Sun,Dinner,3,7.0,Travis Walters,6011812112971322,Sun4458,16.66
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84,Nathaniel Harris,4676137647685994,Sun5260,13.98
4,24.59,3.61,Female,No,Sun,Dinner,4,6.15,Tonya Carter,4832732618637221,Sun2251,14.68


- As `Series`, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
    - First, the `loc` attribute allows indexing and slicing that always references the explicit
index.
    - The `iloc` attribute allows indexing and slicing that always references the implicit
Python-style index

In [37]:
# For example, if you wanted to access the first three rows and first two columns of a DataFrame, you could use iloc like this
print(df.iloc[:3, :2])

# loc is similar to iloc, but it is used for accessing rows and columns of a DataFrame by labe
df.loc[1, "total_bill"]

# For example, in the loc indexer we can combine masking and fancy indexing as in the following
df.loc[df.total_bill > 20, ['total_bill', 'tip', 'sex']]

   total_bill   tip
0       16.99  1.01
1       10.34  1.66
2       21.01  3.50


Unnamed: 0,total_bill,tip,sex
2,21.01,3.50,Male
3,23.68,3.31,Male
4,24.59,3.61,Female
5,25.29,4.71,Male
7,26.88,3.12,Male
...,...,...,...
237,32.83,1.17,Male
238,35.83,4.67,Female
239,29.03,5.92,Male
240,27.18,2.00,Female


There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. 
- First, while indexing refers to columns, slicing refers to rows
- Such slices can also refer to rows by number rather than by index
- Similarly, direct masking operations are also interpreted row-wise rather than column-wise

In [40]:
# Create a new dataframe
area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

# Slicing rows by index
data['Florida':'Illinois']

# Slicing rows by number
data[1:3]

# Masking operations 
data[data.density > 100]

Unnamed: 0,area,pop
Florida,170312,19552860
Illinois,149995,12882135
