# Pandas Lesson Plan and Hacks
> The outline for our lesson plan and hacks
- toc: true
- categories: []
- type: ap
- week: 29

# Predictive Analysis
Predictive analysis is the use of **statistical**, data mining, and machine learning techniques to analyze current and historical data in order to make predictions about future events or behaviors. It involves identifying **patterns** and trends in data, and then using that information to forecast what is likely to happen in the future.

Predictive analysis is used in a wide range of applications, from forecasting sales and demand, to predicting customer behavior, to detecting fraudulent transactions. It involves collecting and analyzing data from a variety of sources, including historical data, customer data, financial data, and social media data, among others.

The process of predictive analysis typically involves the following steps:
1. Defining the problem and identifying the relevant data sources
2. **Collecting and cleaning the data**
3. Exploring and analyzing the data to identify patterns and trends
4. Selecting an appropriate model or algorithm to use for predictions
5. Training and validating the model using historical data
6. **Using the model to make predictions on new data**
7. Monitoring and evaluating the performance of the model over time

Predictive analysis can help organizations make more informed decisions, improve efficiency, and gain a competitive advantage by leveraging insights from data.

It is most commonly used in **Retail**, where workers try to predict which products would be most popular and try to advertise those products as much as possible, and also **Healthcare**, where algorithms analyze patterns and reveal prerequisites for diseases and suggest preventive treatment, predict the results of various treatments and choose the best option for each patient individually, and predict disease outbreaks and epidemics.

# Pandas
## What is Pandas
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.

## Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

## What Can Pandas Do?
Pandas gives you answers about the data. Like:
- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

In [6]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)


    cars  passings
0    BMW         3
1  Volvo         7
2   Ford         2


In [None]:
#print the values in the points column with column header
print(df[['Item ID']])

print()

#try two columns and remove the index from print statement
print(df[['Price','Item ID']].to_string(index=False))

In [None]:
#sort values
print(df.sort_values(by=['Item ID']))

print()

#sort the values in reverse order
print(df.sort_values(by=['Item ID'], ascending=False))

In [None]:
#print only values with a specific criteria 
print(df[df.Price > 1.49])

In [None]:
print(df[df.Price == df.Price.max()])
print()
print(df[df.Price == df.Price.min()])

# Hacks
> Answer the questions below and write code for the last question. Some questions may require individual research.

- What are the key data structures used in Pandas?
    - ?
- How can you merge, join, and concatenate data frames in Pandas?
    - ?
- How can you optimize the performance of Pandas operations on large datasets?
    - ?
- Find a certain scenario, topic, or trend which has quantitative data. Store the data in a file and read it in Pandas. Show some analysis of the data.

In [5]:
import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/padres.csv').sort_values(by=['JerseyNumber'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
print(', '.join(df.tail(10)))

--Duration Top 10---------
               Name          Position  Average  HR  RBI    OPS  JerseyNumber
4        NelsonCruz  DesignatedHitter    0.234  10   64  0.651            32
11       JoseAzocar          Outfield    0.257   0   10  0.630            28
10       AustinNola           Catcher    0.251   4   40  0.649            26
1   FernandoTatisJr        RightField    0.281  42   97  0.975            23
2          JuanSoto         LeftField    0.242  27   62  0.853            22
5     MattCarpenter  DesignatedHitter    0.305  15   37  1.138            14
0      MannyMachado         ThirdBase    0.298  32  102  0.897            13
9     LuisCampusano           Catcher    0.250   1    5  0.593            12
6   JakeCronenworth         FirstBase    0.239  17   88  0.722             9
7       Ha-SeongKim        SecondBase    0.251  11   59  0.708             7
--Duration Bottom 10------
               Name          Position  Average  HR  RBI    OPS  JerseyNumber
10       AustinNola   