# Pandas Lesson Plan and Hacks
> The outline for our lesson plan and hacks
- toc: true
- categories: []
- type: ap
- week: 29

# Predictive Analysis
Predictive analysis is the use of **statistical**, data mining, and machine learning techniques to analyze current and historical data in order to make predictions about future events or behaviors. It involves identifying **patterns** and trends in data, and then using that information to forecast what is likely to happen in the future.

Predictive analysis is used in a wide range of applications, from forecasting sales and demand, to predicting customer behavior, to detecting fraudulent transactions. It involves collecting and analyzing data from a variety of sources, including historical data, customer data, financial data, and social media data, among others.

The process of predictive analysis typically involves the following steps:
1. Defining the problem and identifying the relevant data sources
2. **Collecting and cleaning the data**
3. Exploring and analyzing the data to identify patterns and trends
4. Selecting an appropriate model or algorithm to use for predictions
5. Training and validating the model using historical data
6. **Using the model to make predictions on new data**
7. Monitoring and evaluating the performance of the model over time

Predictive analysis can help organizations make more informed decisions, improve efficiency, and gain a competitive advantage by leveraging insights from data.

It is most commonly used in **Retail**, where workers try to predict which products would be most popular and try to advertise those products as much as possible, and also **Healthcare**, where algorithms analyze patterns and reveal prerequisites for diseases and suggest preventive treatment, predict the results of various treatments and choose the best option for each patient individually, and predict disease outbreaks and epidemics.

# Pandas
### What is Pandas
Pandas is a Python library used for working with data sets. A python library is something  It has functions for analyzing, cleaning, exploring, and manipulating data.

### Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant.Also it is a part of data analysis, and data manipulation.

### What Can Pandas Do?
Pandas gives you answers about the data. Like:
- Is there a correlation between two or more columns?
- What is average value
- Max value
- Min value
- How to load data 
- Delete data 
- Sort Data.

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

# Basics of Pandas.

In [None]:
import pandas as pd
# What this does is it calls the python pandas library and this code segment is needed whenever incorporating pandas.

#### DICTIONARIES AND DATASETS
- One way you are able to manipulate a pandas data set is by creating a dictionary and calling it as seen with the dict data 1 and pd.dataframe which is a way to print the set.

In [4]:
import pandas as pd

data1 = {
  'teams': ["BARCA", "REAL", "ATLETICO"],
  'standings': [1, 2, 3]
}

myvar = pd.DataFrame(data1)

print(myvar)


      teams  standings
0     BARCA          1
1      REAL          2
2  ATLETICO          3


### Indexing and manipulaton of data through lists.
- With python you can also organize data making it one of its biggest perks through something called indexing? Which is a way to set the 1st column of the table as seen with teh 

In [9]:
# Here is an example using lists and an index.
import pandas as pd 

score = [5/5, 5/5, 1/5]

myvar = pd.Series(score, index = ["math", "science", "pe"])

print(myvar)

math       1.0
science    1.0
pe         0.2
dtype: float64


# Pandas Classes 
Within pandas the library consits of a lot of functions which allow you to manipulate datasets in lists dictionsaries and csv files here are some of the ones we are going to cover (hint: take notes on these)
- Series
- Index
- PeriodIndex
- DataframeGroupedBy
- Categorical
- Time Stamp


#  PeriodIndex 
- This allows for a way to repeat data over time that it occurs as seen from january 2022 to december 2023. You can use Y for years, M for months, and D for days.


In [14]:
import pandas as pd


time = pd.period_range('2022-01', '2022-12', freq='M')


print(time)

PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12'],
            dtype='period[M]')


# Dataframe Grouped By 
- This allows for you to organize your data and calculate the different functions such as
- count(): returns the number of non-null values in each group.
- sum(): returns the sum of values in each group.
- mean(): returns the mean of values in each group.
- min(): returns the minimum value in each group.
- max(): returns the maximum value in each group.
- median(): returns the median of values in each group.
- var(): returns the variance of values in each group.
- agg(): applies one or more functions to each group and returns a new DataFrame with the results.


In [18]:
import pandas as pd

data = {
    'Category': ['E', 'F', 'E', 'F', 'E', 'F', 'E', 'F'],
    'Value': [100, 250, 156, 255, 240, 303, 253, 3014]
}
df = pd.DataFrame(data)


grouped = df.groupby('Category').sum()

print(grouped)


          Value
Category       
E           749
F          3822


### Categorical 
- This sets up a category for something and puts it within the categories and allows for better orginzation 

In [23]:
import pandas as pd

colors = pd.Categorical(['yellow', 'orange', 'blue', 'yellow', 'orange'], categories=['yellow', 'orange', 'blue'])

print(colors)


['yellow', 'orange', 'blue', 'yellow', 'orange']
Categories (3, object): ['yellow', 'orange', 'blue']


### Timestamp Class
- This allows to display a single time which can be useful when working with datasets that deal with time allowing you to manipulate the time you do something and how you do it. 

In [26]:
import pandas as pd


timing = pd.Timestamp('2023-02-05 02:00:00')

print(timing)


2023-02-05 02:00:00


# CSV FILES!
- A csv file contains data and within pandas you are able to call the function and you are able to manipulate the data with the certain data classes talked about above. 

- Name, Position, Average, HR, RBI, OPS, JerseyNumber
- Manny Machado, 3B, .298, 32, 102, .897, 13
- Tatis Jr, RF, .281, 42, 97, .975, 23
- Juan Soto, LF, .242, 27, 62, .853, 22
- Xanger Bogaerts, SS, .307, 15, 73, .833, 2
- Nelson Cruz, DH, .234, 10, 64, .651, 32
- Matt Carpenter, DH, .305, 15, 37, 1.138, 14
- Cronezone, 1B, .239, 17, 88, .722, 9
- Ha-Seong Kim, 2B, .251, 11, 59, .708, 7
- Trent Grisham, CF, .184, 17, 53, .626, 1
- Luis Campusano, C, .250, 1, 5, .593, 12
- Austin Nola, C, .251, 4, 40, .649, 26
- Jose Azocar, OF, .257, 0, 10, .630, 28

In [None]:
import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/padres.csv').sort_values(by=['JerseyNumber'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
print(', '.join(df.tail(10)))

# Hacks
- Take notes on content in the notebook file.
- Research another type of class that is used in pandas
- Implement that class with pandas. as seen with the examples above

> Answer the questions below and write code for the last question. Some questions may require individual research.
- What are the key data structures used in Pandas?
    - ?
- How can you merge, join, and concatenate data frames in Pandas?
    - ?
- How can you optimize the performance of Pandas operations on large datasets?
    - ?
- Find a certain scenario, topic, or trend which has quantitative data. Store the data in a file and read it in Pandas. Show some analysis of the data.
