# Data Analytics

![Python and Pandas!](images/PythonAndPandas.png)

## We are going to learn about ...

- What is Pandas
- Pandas & NumPy
- Pandas and Jupyter Notebooks
- What Pandas can do
- Pandas Series
- Pandas DataFrames
    - Viewing 
    - Adding to a DataFrame
    - Sorting DataFrames

<br>

---


## What is Pandas

- An open-source Python package that is most widely used for data science/data analysis and machine learning tasks. 
- Built on top of NumPy which provides support for multi-dimensional arrays.
- References both “Panel Data” and “Python Data Analysis”
- The name Pandas is derived from the word "Panel Data"
- Created by Wes McKinney in 2008
- Official documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide
- Community tutorials: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html

## Pandas & NumPy

- NumPy is a library that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
- Pandas is a high-level data manipulation tool that is built on the NumPy package
- Pandas offers an in-memory 2d table object called a DataFrame
- A DataFrame is structured like a table or spreadsheet -- with rows and columns
- There are a few functions that exist in NumPy that we use specifically on Pandas DataFrames
- Just as the "ndarray" is the foundation of NumPy, the "Series" is the core object of Pandas
- NumPy consumes less memory than Pandas, and is faster than Pandas
- These two libraries are the best libraries for data science applications
- Pandas mainly works with tabular data, whereas NumPy works with numerical data

## Pandas & Jupyter Notebooks

Jupyter Notebooks offer a good environment for using pandas to do data exploration and modeling, but pandas can also be used in text editors just as easily.

Jupyter Notebooks give us the ability to execute code in a particular cell as opposed to running the entire file. This saves a lot of time when working with large datasets and complex transformations. 

Notebooks also provide an easy way to visualize pandas’ DataFrames and plots.


## What can Pandas do?

Pandas can perform five significant steps required for processing and analysis of data, irrespective of the origin of the data, -- load, manipulate, prepare, model, and analyze.

What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called a 'data frame' that looks very similar to table in statistical software (think Excel).

In fact, with Pandas, you can do everything that makes world-leading data scientists vote Pandas as the best data analysis and manipulation Python tool available.

### Pandas can do ...

|    |    |
|----|----|
| Data Cleansing | Data fill |
| Data normalization | Merges and joins |
| Data visualization | Statistical analysis |
| Data inspection | Loading and saving data |

<br>

## Installing and Using Pandas

**Remember: Pandas is a Module.**

You have to install it first, and NumPy is required:

```python
    pip install pandas
```

Then you have to import it at the beginning of every code file to use it:

```python
    import pandas as pd
```

<br>

---

## DataFrames & Series

A Series is essentially a column, and a DataFrame is a two-dimensional table made up of a collection of Series.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean

![Pandas Series and DataFrames](images/Pandas_series-and-dataframe.png)

### Pandas Series

- A Pandas Series is like a column in a table
- It is a one-dimensional array holding data of any type
- If nothing else is specified, the values of the series are labeled with their index numbers -- first value has index 0, second value has index 1 etc.
- These labels can be used to access specified values in the series
- With the index argument, you can name your own labels for the indexes of your series
- When you have created labels, you can access an item by referring to the label
- You can also use a key/value object, like a dictionary, when creating a Series
- You can create a DataFrame from two Series

### Pandas DataFrames

- A Pandas DataFrame is a 2D data structure, like a 2 dimensional array, or a table with rows and columns
- Pandas use the `loc` attribute to return one or more specified row(s)
- With the index argument, you can name your own indexes
- Use the named index in the `loc` attribute to return the specified row(s)
- If your data sets are stored in a file, Pandas can load them into a DataFrame



---
### Working with Series...

In pandas, Series is a one-dimensional, labeled array, capable of holding any data type (integers, strings, floating-point numbers, Python objects, etc.). 

Series store data in sequential order. It is one-column information similar to a columns in an excel sheet/SQL table.

#### Series Example

Create a new Python file -- make sure it has a `.py` extension

Type this code into your Python file, open the terminal, and run it...

```python
    import pandas as pd
    age = [20, 40, 60]
    years = pd.Series(age)
    print(years)

```
> To run it in your terminal type ...

```bash
    python filename.py
```

> Note: For the MacOS crowd, the command needs to be **`python3 ...`**

In [None]:
import pandas as pd

age = [20, 40, 60]
years = pd.Series(age)
print(years)

**Finding the location of a row in a series ...**

In [None]:
# find a row using the index value
import pandas as pd

age = [20, 40, 60]
years = pd.Series(age)

print(age[0])

### How to Combine Two Series into a Pandas DataFrame

Using the Pandas `.concat()`, `series.append()`, `Pandas.merge()`, or `dataFrame.join()` methods you can combine / merge two or more series into a DataFrame.

#### 1.) Combine Two Series Using Pandas `.concat()`


In [None]:
# you can combine multiple series along a particular axis (column-wise or row-wise)
import pandas as pd

# Create pandas Series
courses = pd.Series(["Spark","PySpark","Hadoop"])
fees = pd.Series([22000,25000,23000])
discount  = pd.Series([1000,2300,1000])

# Combine two Series
df = pd.concat([courses, fees], axis=1)
print("Concat 2 lists ...\n", df)

# Combine multiple Series
df = pd.concat([courses, fees, discount], axis=1)
print("\nConcat 3 lists ...\n", df)


> Note that if a Series doesn’t contains names, and names are not provided for columns while merging, default numbers are assigned to columns.

In [None]:
# Create Series by assigning names to each column
import pandas as pd

courses = pd.Series(["Spark","PySpark","Hadoop"], name='courses')
fees = pd.Series([22000,25000,23000], name='fees')
discount  = pd.Series([1000,2300,1000],name='discount')

df = pd.concat([courses,fees,discount],axis=1)
print(df)

If you add custom indexes to a Series, the `combine()` method carries the same indexes to the created DataFrame.

In [None]:
# Create Series with assigned indexes and provide custom column names to each
import pandas as pd

courses = pd.Series(["Spark","PySpark","Hadoop"], name='courses')
fees = pd.Series([22000,25000,23000], name='fees')
discount  = pd.Series([1000,2300,1000],name='discount')

# Assign Index to Series
index_labels=['r1','r2','r3']
courses.index = index_labels
fees.index = index_labels
discount.index = index_labels

# Concat Series by Changing Names
df = pd.concat({'Courses': courses,
                'Course_Fee': fees,
                'Course_Discount': discount},axis=1)
print(df)

Finally, let's see how to reset the indexes using the `reset_index()` method. 

This moves the current index as a column and adds a new index to a combined DataFrame.

In [None]:
# Create Series with assigned indexes and provide custom column names to each
import pandas as pd

courses = pd.Series(["Spark","PySpark","Hadoop"], name='courses')
fees = pd.Series([22000,25000,23000], name='fees')
discount  = pd.Series([1000,2300,1000],name='discount')

# Assign Index to Series
index_labels = ['r1','r2','r3']
courses.index = index_labels
fees.index = index_labels
discount.index = index_labels

# Concat Series by Changing Names
df=pd.concat({'Courses': courses,
                'Course_Fee': fees,
                'Course_Discount': discount},axis=1)

#change the index to a column & create new index
df = df.reset_index()

print(df)

#### 2.) Combine Two Series Using `pandas.merge()`

The Pandas `merge()` method is used to combine complex column-wise combinations of DataFrame similar to SQL-like joins. 

Pandas `merge()` can be used for all database join operations between DataFrame or named Series objects. You have to pass the extra parameter “name” to the series in this case.

Syntax:- `pd.merge(S1, S2, right_index=True, left_index=True)`.

In [None]:
# Create Series by assigning names
import pandas as pd

courses = pd.Series(["Spark","PySpark","Hadoop"], name='courses')
fees = pd.Series([22000,25000,23000], name='fees')

# using pandas series merge()
df = pd.merge(courses, fees, right_index = True,
                left_index = True)
print(df)

#### 3.) Using `Series.append()` to Combine Two Series

You can use `pandas.DataFrame(Series.append(Series,ignore_index=True))` to create a DataFrame by appending series to another series. 

Note that in this example it doesn’t create multiple columns instead it just appends as multiple row’s.

In [None]:
# Using Series.append()
import pandas as pd

courses_am = pd.Series(["Spark","PySpark","Hadoop"])
courses_pm = pd.Series(["Pandas","Python","Scala"])

df = pd.DataFrame(courses_am.append(courses_pm,
                                    ignore_index = True),
                    columns=['all_courses'])
print(df)


### Warning: Libraries and languages change over time!

This error message from the example above:

![Pandas Series and DataFrames](images/futurewarning_depricated_method.png)

Tells us not only that the method we are using "append" is marked for removal, it also suggests a replacement method. 

If this is the first time you have seen this error _(warning)_ then you will likely need to review the documentation to learn the features of the method and it's syntax. 

pandas.concat - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html?highlight=concat#pandas.concat

Let's rewrite our code and give it a test!

In [None]:
# Using concat()
import pandas as pd

courses_am = pd.Series(["Spark","PySpark","Hadoop"])
courses_pm = pd.Series(["Pandas","Python","Scala"])

df = pd.DataFrame(pd.concat([courses_am,courses_pm], ignore_index=True))

print(df)

#### 4.) Combine Two Series Using `DataFrame.join()`

You can also use `DataFrame.join()` to join two series. 

In order to use the `DataFrame.join()`, you need to have a DataFrame object. One way to get is by creating a DataFrame from some Series, and then use the DataFrame to combine with another Series.

In [None]:
# create Series with assigning names
import pandas as pd

courses = pd.Series(["Spark","PySpark","Hadoop"], name='courses')
fees = pd.Series([22000,25000,23000], name='fees')

# Using DataFrame.join()
df = pd.DataFrame(courses).join(fees)

print(df)


---
### Working with Dataframes ...

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple `dict`, and then pass it to the Pandas DataFrame constructor.

Each (key, value) item in the dictionary will correspond to a column in the resulting DataFrame.

The Index of this DataFrame is given by default on creation, but we could also create our own when we initialize the DataFrame.

In [None]:
import pandas as pd

# create a dictionary
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

# pass the dict to the Pandas DataFrame constructor
purchases = pd.DataFrame(data)
print("Purchases DataFrame ...\n", purchases)

# we could create our own indexes when we initialize the DataFrame
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

print("\nPurchases w/ customer indexes ...\n", purchases)


#### DataFrame Example

Create a new Jupyter Notebook file -- make sure it has a `.ipynb` extension. You could run Jupyter inside VS Code or in the Jupyter Notebook console in the browser.

In [None]:
# working with Pandas DataFrames
import pandas as pd

Report = {
    "Classes": ["Math", "Science", "Spanish", "History", "Health"],
    "Grades": [75, 80, 95, 60, 100]
    }

results = pd.DataFrame(Report)
print(results)


**Finding the location of a row in a DataFrame...**

In [None]:
# find the location of a row
import pandas as pd

Report = {
    "Classes": ["Math", "Science", "Spanish", "History", "Health"],
    "Grades": [75, 80, 95, 60, 100]
    }

results = pd.DataFrame(Report)
print(results.loc[3])
type(results.loc[3])


> Note: This example above returns a Pandas Series.

In [None]:
# find the Location of More than 1 row
import pandas as pd

Report = {
    "Classes": ["Math", "Science", "Spanish", "History", "Health"],
    "Grades": [75, 80, 95, 60, 100]
    }

results = pd.DataFrame(Report)
print(results.loc[[2, 3]])
type(results.loc[[2, 3]])


> Note: When using `[[ ]]` above, the result is a Pandas DataFrame.

In [None]:
# naming the rows / indexes
import pandas as pd

Report = {
    "Classes": ["Math", "Science", "Spanish", "History", "Health"],
    "Grades": [75, 80, 95, 60, 100]
    }

results = pd.DataFrame(Report, index = ["week1", "week2", "week3", "week4", "week5"])
print(results)


In [None]:
# Locating a specific row using the named indexes
import pandas as pd

Report = {
    "Classes": ["Math", "Science", "Spanish", "History", "Health"],
    "Grades": [75, 80, 95, 60, 100]
    }

results = pd.DataFrame(Report, index = ["week1", "week2", "week3", "week4", "week5"])
print(results.loc["week3"])


#### Sorting DataFrames

We'll look at different methods to sort a DataFrame

- Sorting in Ascending order
- Sorting in Descending order
- Sorting by putting missing values first
- Sorting by multiple columns


In [None]:
# importing Pandas library
import pandas as pd

# creating and initializing a nested list
age_list = [['Afghanistan', 1952, 8425333, 'Asia'],
            ['Australia', 1957, 9712569, 'Oceania'],
            ['Brazil', 1962, 76039390, 'Americas'],
            ['China', 1957, 637408000, 'Asia'],
            ['France', 1957, 44310863, 'Europe'],
            ['India', 1952, 3.72e+08, 'Asia'],
            ['South Africa', 1966, 0.0, 'Africa'],
            ['United States', 1957, 171984000, 'Americas']]

# creating a Pandas DataFrame
import pandas as pd

df = pd.DataFrame(age_list, columns=['Country', 'Year',
                                    'Population', 'Continent'])
print("Original DataFrame ...\n", df)

#### ASCENDING EXAMPLE ####
# Sorting the DataFrame in Ascending order -- Sorting by column 'Continent'
df.sort_values(by = ['Continent'], inplace = True)
#print("\nDF sorted by Continent ...\n", df)


#### DESCENDING EXAMPLE ####
# Sorting the Data frame in Descending order -- Sorting by column "Population"
df.sort_values(by = ['Country'], inplace = True, ascending = False)
# print("\nDF sorted by Country descending ...\n", df)


#### MISSING VALUES EXAMPLE ####
# Sorting column "Population" by putting missing values first
df.sort_values(by = ['Population'], inplace = True, na_position = 'first')
# print("\nDF sorted by missing values first ...\n", df)


#### MULTI COLUMN SORT EXAMPLE ####
# Sorting by multiple columns -- "Country" and then "Continent"
df.sort_values(by = ['Continent', 'Country'], inplace = True)
# print("\nDF sorting multiple columns ...\n", df)


#### EXAMPLE SORT MULTI COLUMNS IN DIFFERENT ORDER ####
# Sorting Data frames by multiple columns but different order
# Sorting "Country" descending, and "Continent" ascending
df.sort_values(by = ['Country', 'Continent'],
                ascending = [False, True], inplace = True)
print("\nDF sorting multiple columns in different order ...\n", df)


#### Fun Facts regarding US 2020 census:

https://usafacts.org/state-of-the-union/population/?msclkid=fe650f642143182d43d06c190597ba46

#### Adding new column to existing DataFrame in Pandas

We'll look at different methods to add a new column to a DataFrame
- By declaring a new list as a column
- By using `DataFrame.insert()`
- Using `Dataframe.assign()` method
- By using a dictionary


In [None]:
# Import pandas package
import pandas as pd

# Define a dictionary containing Students data
data = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
        'Height': [5.1, 6.2, 5.1, 5.2],
        'Qualification': ['Msc', 'MA', 'Msc', 'Msc']}

# Convert the dictionary into DataFrame
df = pd.DataFrame(data)
print("Original DataFrame ...\n", df)


#### LIST AS COLUMN EXAMPLE ####
# Declare a list that is to be converted into a column
address = ['Delhi', 'Bangalore', 'Chennai', 'Patna']
# Using 'Address' as the column name and equating it to the list
df['Address'] = address

# print("\nDF with column from list ...\n", df)


#### INSERT EXAMPLE ####
# Using DataFrame.insert() to add a column
df.insert(2, "Age", [21, 23, 24, 21],True)
# print("\nDF with insert as column 2 ...\n", df)


#### ASSIGN EXAMPLE ####
# Using 'Address' as the column name and assign it to the list
df = df.assign(Pets=['Dog', 'Bunny', 'Chinchilla', 'Parrot'])

# print("\nDF with assigned column added ...\n", df)


#### DICTIONARY EXAMPLE ####
# Define a dictionary with keys of an existing column
# and their respective values as the values for our new column
# If a primary key is defined use that key
sport = {'Jai': 'Darts', 'Princi': 'Basketball',
                'Gaurav': 'PaddleBoarding', 'Anuj': 'Cricket'}

# Provide 'Sport' as the new column name and map it to the key column
df['Sport'] = df['Name'].map(sport)
print("\nDF with new column from dictionary ...\n", df)


#### View & Print a Summary of a DataFrame

Data visualization is the technique used to deliver insights in data using visual cues such as graphs, charts, maps, and many others. 

This is useful as it permits intuitive and easy understanding of large quantities of data to facilitate making better business decisions. 

When we use a standard  print in Pandas like `print(df)`, by default, the complete data frame is not printed if the length exceeds the default length, the output is truncated.

With this print statement, you get the first 5 lines & the last 5 lines With the row and column count of the DataFrame


In [None]:
# example of printing a Pandas DataFrame
import pandas as pd

df = pd.read_csv('./resources/data.csv')

# print first 5 & last 5 lines With the row and column count
print(df)

In [None]:
# statement almost gives you the Whole DataFrame
import pandas as pd

df = pd.read_csv('./resources/data.csv')

# note this Output Exceeds the size limit
print(df.to_string())

In [None]:
# check the Max Rows on your system with the max_rows command
import pandas as pd

df = pd.read_csv('./resources/data.csv')

print(pd.options.display.max_rows)

In [None]:
# change the Max Rows setting on your system
import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('./resources/data.csv')
print(df)

#### Viewing the First or Last `N` rows of a DataFrame

View the First or Last `N` number of rows from a DataFrame using the `.head()` OR `.tail()` commands. If you have an empty `.head()` or `.tail()` command, you get only the first or last FIVE (5) rows.

    -   `print(df.head(10))`
    -   `print(df.tail(10))`

In [None]:
# Viewing the FIRST or LAST 12 rows
import pandas as pd

df = pd.read_csv('./resources/data.csv')

# view the head
print(df.head(12))

# view the tail
print(df.tail(12))

#### Viewing information about the DataFrame

You can view a summary of your DataFrame using the `.info()` command. If you print an empty info command, you get the DataFrame Summary.

The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values). 

> Note: the `.info()` method actually prints the info; You do not need use the `.print()` method to print the info.


In [None]:
import pandas as pd

df = pd.read_csv('./resources/data.csv')
df.info()

#### **Below is a detailed explanation of the DataFrame `.info()` display**

![Pandas DataFrame Info Display Explained](images/Pandas_DF_InfoDisplay.png)

<br>

In the next section we'll look at Data Ingestion with Pandas, particularly Pandas and .CSV files