#  Petite Pandas Data Analysis using Pandas and NumPy
> A series of lessons based on the utilization of Pandas and NumPy to read and create interesting things with files and data. 
- toc: true

# Predictive Analysis
Predictive analysis is the use of statistical, data mining, and machine learning techniques to analyze current and historical data in order to make predictions about future events or behaviors. It involves identifying patterns and trends in data, and then using that information to forecast what is likely to happen in the future.

Predictive analysis is used in a wide range of applications, from forecasting sales and demand, to predicting customer behavior, to detecting fraudulent transactions. It involves collecting and analyzing data from a variety of sources, including historical data, customer data, financial data, and social media data, among others.

The process of predictive analysis typically involves the following steps:
1. Defining the problem and identifying the relevant data sources
2. Collecting and cleaning the data
3. Exploring and analyzing the data to identify patterns and trends
4. Selecting an appropriate model or algorithm to use for predictions
5. Training and validating the model using historical data
6. Using the model to make predictions on new data
7. Monitoring and evaluating the performance of the model over time

Predictive analysis can help organizations make more informed decisions, improve efficiency, and gain a competitive advantage by leveraging insights from data.

It is most commonly used in Retail, where workers try to predict which products would be most popular and try to advertise those products as much as possible, and also Healthcare, where algorithms analyze patterns and reveal prerequisites for diseases and suggest preventive treatment, predict the results of various treatments and choose the best option for each patient individually, and predict disease outbreaks and epidemics.

# 1. Intro to NumPy and the features it consists

Numpy, by definition, is the fundamental package for scientific computing in Python which can be used to perform mathematical operations, provide multidimensional array objects, and makes data analysis much easier. Numpy is very important and useful when it comes to data analysis, as it can easily use its features to complete and perform any mathematical operation, as well as analyze data files. 

If you don't already have numpy installed, you can do so using ```conda install numpy``` or ```pip install numpy```

Once that is complete, to import numpy in your code, all you must do is:

In [1]:
import numpy as np

# 2. Using NumPy to create arrays
An array is the central data structure of the NumPy library. They are used as containers which are able to store more than one item at the same time. Using the function ```np.array``` is used to create an array, in which you can create multidimensional arrays. 

Shown below is how to create a 1D array:

In [2]:
a = np.array([1, 2, 3])
print(a) 
# this creates a 1D array

[1 2 3]


How could you create a 3D array based on knowing how to make a 1D array?

In [7]:
# create 3D array here
import numpy as np
_3darray = np.zeros((3,4,2))

_3darray[1,2,0]

for i in range(3):
    for j in range(4):
        for k in range(2):
            _3darray[i,j,k] = i*j*k
            
print(_3darray)

[[[0. 0.]
  [0. 0.]
  [0. 0.]
  [0. 0.]]

 [[0. 0.]
  [0. 1.]
  [0. 2.]
  [0. 3.]]

 [[0. 0.]
  [0. 2.]
  [0. 4.]
  [0. 6.]]]


Arrays can be printed in different ways, especially a more readable format. As we have seen, arrays are printed in rows and columns, but we can change that by using the ```reshape``` function 

In [None]:
c = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(c.reshape(1, 9)) # organizes it all in a single line of output

In the code segment below, we can also specially select certain rows and columns from the array to further analyze selective data.

In [None]:
print(c[1:, :2])
# the 1: means "start at row 1 and select all the remaining rows"
# the :2 means "select the first two columns"

# 3. Basic array operations

One of the most basic operations that can be performed on arrays is arithmetic operations. With numpy, it is very easy to perform arithmetic operations on arrays. You can add, subtract, multiply and divide arrays, just like you would with regular numbers. When performing these operations, numpy applies the operation element-wise, meaning that it performs the operation on each element in the array separately. This makes it easy to perform operations on large amounts of data quickly and efficiently.

In [3]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # adds each value based on the column the integer is in
print(a - b) # subtracts each value based on the column the integer is in
print(a * b) # multiplies each value based on the column the integer is in
print(a / b) # divides each value based on the column the integer is in

[5 7 9]
[-3 -3 -3]
[ 4 10 18]
[0.25 0.4  0.5 ]


In [4]:
d = np.exp(b)
e = np.sqrt(b)
print(d)
print(e)

[ 54.59815003 148.4131591  403.42879349]
[2.         2.23606798 2.44948974]


From the knowledge of how to use more advanced mathematical expressions than the basic 4 mathematical operations such as exponent and square root, now can you code how to calculate the 3 main trig expressions (sin, cos, tan), natural log, and log10 of a 1D array.

In [11]:
array = np.array([1, 2, 3])

# calculate sin
sinx_x = np.sin(array)
print(sinx_x)

# calculate cos
cos_x = np.cos(array)
print(cos_x)

# calculate tan
tanx_x = np.tan(array)
print(tanx_x)

# calculate natural log
ln_x = np.log(array)
print(ln_x)

# calculate log10
log10_x = np.log10(array)
print(log10_x)


[0.84147098 0.90929743 0.14112001]
[ 0.54030231 -0.41614684 -0.9899925 ]
[ 1.55740772 -2.18503986 -0.14254654]
[0.         0.69314718 1.09861229]
[0.         0.30103    0.47712125]


# 4. Data analysis using numpy
Numpy provides a convenient and powerful way to perform data analysis tasks on large datasets. One of the most common tasks in data analysis is finding the mean, median, and standard deviation of a dataset. Numpy provides functions to perform these operations quickly and easily. The mean function calculates the average value of the data, while the median function calculates the middle value in the data. The standard deviation function calculates how spread out the data is from the mean. Additionally, numpy provides functions to find the minimum and maximum values in the data. These functions are very useful for gaining insight into the properties of large datasets and can be used for a wide range of data analysis tasks.

In [12]:
data = np.array([2, 5, 12, 13, 19])
print(np.mean(data)) # finds the mean of the dataset
print(np.median(data)) # finds the median of the dataset
print(np.std(data)) # finds the standard deviation of the dataset
print(np.min(data)) # finds the min of the dataset
print(np.max(data)) # finds the max of the dataset

10.2
12.0
6.04648658313239
2
19


Now from learning this, can you find a different way from how we can solve the sum or products of a dataset other than how we learned before?
- you can use built in python functions
- you can use for loops

In [16]:
# create a different way of solving the sum or products of a dataset from what we learned above
#this code finds the sum and product of a dataset using a for loop
data = [1, 2, 3, 4, 5]
sum_data = 0
prod_data = 1
for x in data:
    sum_data += x
    prod_data *= x
print(sum_data, prod_data) 

15 120


Numpy also has the ability to handle CSV files, which are commonly used to store and exchange large datasets. By importing CSV files into numpy arrays, we can easily perform complex operations and analysis on the data, making numpy an essential tool for data scientists and researchers.

```genfromtxt``` and ```loadtxt``` are two functions in the numpy library that can be used to read data from text files, including CSV files.

```genfromtxt``` is a more advanced function that can be used to read text files that have more complex structures, including CSV files. ```genfromtxt``` can handle files that have missing or invalid data, or files that have columns of different data types. It can also be used to skip header lines or to read only specific columns from the file. 

In [None]:
import numpy as np

padres = np.genfromtxt('files/padres.csv', delimiter=',', dtype=str, encoding='utf-8')
# delimiter indicates that the data is separated into columns which is distinguished by commas
# genfromtxt is used to read the csv file itself
# dtype is used to have numpy automatically detect the data type in the csv file

print(padres)

```loadtxt``` is a simpler function that can be used to read simple text files that have a regular structure, such as files that have only one type of data (such as all integers or all floats). ```loadtxt``` can be faster than ```genfromtxt``` because it assumes that the data in the file is well-structured and can be easily parsed.

In [None]:
import numpy as np

padres = np.loadtxt('files/padres.csv', delimiter=',', dtype=str, encoding='utf-8')
print(padres)

In [18]:
for i in padres:
    print(",".join(i))

Name, Position, Average, HR, RBI, OPS, JerseyNumber
Manny Machado, 3B, .298, 32, 102, .897, 13
Fernando Tatis Jr, RF, .281, 42, 97, .975, 23
Juan Soto, LF, .242, 27, 62, .853, 22
Xander Bogaerts, SS, .307, 15, 73, .833, 2
Nelson Cruz, DH, .234, 10, 64, .651, 32
Matt Carpenter, DH, .305, 15, 37, 1.138, 14
Jake Cronenworth, 1B, .239, 17, 88, .722, 9
Ha-Seong Kim, 2B, .251, 11, 59, .708, 7
Trent Grisham, CF, .184, 17, 53, .626, 1
Luis Campusano, C, .250, 1, 5, .593, 12
Austin Nola, C, .251, 4, 40, .649, 26
Jose Azocar, OF, .257, 0, 10, .630, 28


# Pandas
### What is Pandas
Pandas is a Python library used for working with data sets. A python library is something  It has functions for analyzing, cleaning, exploring, and manipulating data.

### Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Also it is a part of data analysis, and data manipulation.

### What Can Pandas Do?
Pandas gives you answers about the data. Like:
- Is there a correlation between two or more columns?
- What is average value
- Max value
- Min value
- How to load data 
- Delete data 
- Sort Data.

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

# Basics of Pandas.

In [None]:
import pandas as pd
# What this does is it calls the python pandas library and this code segment is needed whenever incorporating pandas.

#### DICTIONARIES AND DATASETS
- One way you are able to manipulate a pandas data set is by creating a dictionary and calling it as seen with the dict data 1 and pd.dataframe which is a way to print the set.

In [16]:
import pandas as pd

data1 = {
  'teams': ["BARCA", "REAL", "ATLETICO"],
  'standings': [1, 2, 3]
}

myvar = pd.DataFrame(data1)

print(myvar)


      teams  standings
0     BARCA          1
1      REAL          2
2  ATLETICO          3


### Indexing and manipulaton of data through lists.
- With pandas you can also organize the data which is one of its biggest perks, we call this indexing, this is when we define the first column in a data frame.

In [None]:
# Here is an example using lists and an index.
import pandas as pd 

score = [5/5, 5/5, 1/5]

myvar = pd.Series(score, index = ["math", "science", "pe"])

print(myvar)

# Pandas Classes 
Within pandas the library consists of a lot of functions which allow you to manipulate datasets in lists dictionaries and csv files here are some of the ones we are going to cover (hint: take notes on these)
- Series
    - 1d labeled array.
    - can hold any data type including integers, strings, floats, etc
    - data is aligned with an index
- Index
    - immutable array-like object that provides a labels based look up for series or dataframe objects
    -  can be created from a list or array of labels
- PeriodIndex
    -  represents a sequence of periods (e.g., months or quarters) and is used to index pandas data structures.
    - It can be created using the period_range function
    - It represents a collection of DataFrame objects, each corresponding to a group defined by one or more keys.
    - It is a powerful tool for data analysis and aggregation, allowing for efficient processing of large datasets
- DataframeGroupedBy
    - is an object returned by a groupby operation in pandas
- Categorical
    - It represents a finite set of possible values (categories) that data can take on.
    - It is useful for representing data that has a limited number of possible values and can improve performance and memory usage.
- Time Stamp
    - data type for representing a point in time.
    - It is the fundamental data type for creating a DatetimeIndex, which is used for indexing time-series data in pandas.

#  PeriodIndex 
- This allows for a way to repeat data over time that it occurs as seen from january 2022 to december 2023. You can use Y for years, M for months, and D for days.


In [13]:
import pandas as pd


time = pd.period_range('2022-01', '2022-12', freq='M')


print(time)

PeriodIndex(['2022-01', '2022-02', '2022-03', '2022-04', '2022-05', '2022-06',
             '2022-07', '2022-08', '2022-09', '2022-10', '2022-11', '2022-12'],
            dtype='period[M]')


Now implement a way to show a period index from June 2022 to July 2023 in days.

In [14]:
# use period index to show - in days - June 2022 to 2023
import pandas as pd


time = pd.period_range('2022-06', '2023-07', freq='D')


print(time)

PeriodIndex(['2022-06-01', '2022-06-02', '2022-06-03', '2022-06-04',
             '2022-06-05', '2022-06-06', '2022-06-07', '2022-06-08',
             '2022-06-09', '2022-06-10',
             ...
             '2023-06-22', '2023-06-23', '2023-06-24', '2023-06-25',
             '2023-06-26', '2023-06-27', '2023-06-28', '2023-06-29',
             '2023-06-30', '2023-07-01'],
            dtype='period[D]', length=396)


# Dataframe Grouped By 
- This allows for you to organize your data and calculate the different functions such as
- count(): returns the number of non-null values in each group.
- sum(): returns the sum of values in each group.
- mean(): returns the mean of values in each group.
- min(): returns the minimum value in each group.
- max(): returns the maximum value in each group.
- median(): returns the median of values in each group.
- var(): returns the variance of values in each group.
- agg(): applies one or more functions to each group and returns a new DataFrame with the results.


In [None]:
import pandas as pd

data = {
    'Category': ['E', 'F', 'E', 'F', 'E', 'F', 'E', 'F'],
    'Value': [100, 250, 156, 255, 240, 303, 253, 3014]
}
df = pd.DataFrame(data)


grouped = df.groupby('Category') #GUESS WHAT THIS WOULD BE IF WE WERE LOOKING FOR COMBINED TOTALS!()

print(grouped)


### Categorical 
- This sets up a category for something and puts it within the categories and allows for better organization 

In [None]:
import pandas as pd

colors = pd.Categorical(['yellow', 'orange', 'blue', 'yellow', 'orange'], categories=['yellow', 'orange', 'blue'])

print(colors)

### Timestamp Class
- This allows to display a single time which can be useful when working with datasets that deal with time allowing you to manipulate the time you do something and how you do it. 

In [15]:
import pandas as pd

timing = pd.Timestamp('2023-02-05 02:00:00')

print(timing)

2023-02-05 02:00:00


# CSV FILES!
- A csv file contains data and within pandas you are able to call the function and you are able to manipulate the data with the certain data classes talked about above. 

- Name, Position, Average, HR, RBI, OPS, JerseyNumber
- Manny Machado, 3B, .298, 32, 102, .897, 13
- Tatis Jr, RF, .281, 42, 97, .975, 23
- Juan Soto, LF, .242, 27, 62, .853, 22
- Xanger Bogaerts, SS, .307, 15, 73, .833, 2
- Nelson Cruz, DH, .234, 10, 64, .651, 32
- Matt Carpenter, DH, .305, 15, 37, 1.138, 14
- Cronezone, 1B, .239, 17, 88, .722, 9
- Ha-Seong Kim, 2B, .251, 11, 59, .708, 7
- Trent Grisham, CF, .184, 17, 53, .626, 1
- Luis Campusano, C, .250, 1, 5, .593, 12
- Austin Nola, C, .251, 4, 40, .649, 26
- Jose Azocar, OF, .257, 0, 10, .630, 28

QUESTION: WHAT DO YOU GUYS THINK THE INDEX FOR THIS WOULD BE?
- names or the index of the keys

Can you explain what is going on in this code segment below. (hint: define what ascending= false means, and df. head means)

In [None]:
import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/padres.csv').sort_values(by=['Name'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
print(', '.join(df.tail(10)))

ascending false means that the duration is not ordered in ascending order
df.head prints the first 10 objects
df.tail prints the last 10 objects

In [None]:
import pandas as pd


df = pd.read_csv("./files/housing.csv")


mode_total_rooms = df['total_rooms'].mode()


print(f"The mode of the 'total_rooms' column is: {mode_total_rooms}")

In [None]:
import pandas as pd

df = pd.read_csv("./files/housing.csv")


grouped_df = df.groupby('total_rooms')


agg_df = grouped_df.agg({'total_rooms': 'sum', 'population': 'mean', 'longitude': 'count'})



print(agg_df)

### WHAT DO YOU GUYS THINK df.agg means in context of pandas and what does it stand for.
- df.agg() is used to apply aggregation functions on a grouped DataFrame. In this case, the DataFrame df is first grouped by the 'total_rooms' column using the groupby() method. Then, aggregation functions are applied to each group using the agg() method.
- agg() stands for "aggregate" and it applies one or more functions to each group and returns a new DataFrame with the results.


# Our Frontend Data Analysis Project
[Link](https://paravsalaniwal.github.io/T3Project/DataAnalysisProject/)

# Popcorn Hacks
- Complete fill in the blanks for Predictive Analysis Numpy `DONE`
- Takes notes on Panda where it asks you to `DONE`
- Complete code segment tasks in Panda and Numpy  `DONE`

# Main Hack
- Make a data file - content is up to you, just make sure there are integer values - and print
- Run Panda and Numpy commands
    - Panda:
        - Find Min and Max values `DONE`
        - Sort in order - can be order of least to greatest or vice versa `DONE`
        - Create a smaller data frame and merge it with your data file

In [9]:
import pandas as pd

#read csv and sort cars bt 'Price' from largest to smallest
data = pd.read_csv('files/cars.csv').sort_values(by=['price'], ascending=False)

columns = data[['brand','color','price']]

#print max and min price 
print("Max price")
print(columns[columns.price == columns.price.max()])
print()
print("Min price")
print(columns[columns.price == columns.price.min()])
print()

#sorting by price
print("--Price Top 10---------")
print(columns.head(10))
print()
print("--Price Bottom 10------")
print(columns.tail(10))

Max price
             brand   color  price
466  mercedes-benz  silver  84900

Min price
         brand   color  price
420  chevrolet  silver      0

--Price Top 10---------
               brand   color  price
466    mercedes-benz  silver  84900
274            dodge    blue  67000
371              bmw   black  61200
384             ford    blue  58500
393            lexus  silver  55600
44              ford   black  55000
353  harley-davidson   black  54680
49              ford   black  54000
365             ford   white  53500
95               bmw    blue  53500

--Price Bottom 10------
         brand      color  price
312  chevrolet      white     25
435       ford       gray     25
366       ford       gray     25
359        gmc      white     25
356  chevrolet       gray     25
431  chevrolet      black     25
281      dodge  dark blue     25
495  chevrolet      white     25
337       ford      white     25
420  chevrolet     silver      0


In [18]:
#merging data frames

import pandas as pd

df = pd.read_csv('files/cars.csv')
smaller_df = df[["brand", "color", "price"]]

smaller_df = df[df["brand"].isin(["3B", "SS"])][["brand", "color", "price"]]

merged_df = pd.merge(df, smaller_df, on="brand")

- Numpy:
     - Random number generation
     - create a multi-dimensional array (multiple elements)
     - create an array with linearly spaced intervals between values

In [11]:
#random number generation

import numpy as np

#generate a random float between 0 and 1
rand_num = np.random.rand()
print(rand_num)

#generate an array of random integers between 0 and 9 of shape (3, 3)
rand_int_arr = np.random.randint(10, size=(3, 3))
print(rand_int_arr)


0.6292344914864445
[[1 5 2]
 [5 6 8]
 [6 0 9]]


In [13]:
#multi dimensional arrays

#2D array of shape (3, 4) with all elements initialized to 0
zeros_arr = np.zeros((3, 4))
print(zeros_arr)
print()

#3D array of shape (2, 3, 4) with all elements initialized to 1
ones_arr = np.ones((2, 3, 4))
print(ones_arr)
print()

#4D array of shape (2, 3, 4, 5) with random values from a uniform distribution between 0 and 1
rand_arr = np.random.rand(2, 3, 4, 5)
print(rand_arr)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

[[[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]

 [[1. 1. 1. 1.]
  [1. 1. 1. 1.]
  [1. 1. 1. 1.]]]

[[[[0.69454421 0.25366735 0.62689366 0.05798716 0.74825873]
   [0.69778232 0.61132514 0.54655762 0.40403862 0.65147631]
   [0.97553812 0.77913322 0.31808833 0.57513014 0.35863899]
   [0.20359516 0.76592339 0.43124031 0.15114225 0.3493766 ]]

  [[0.67244372 0.03946627 0.07425983 0.06087298 0.30959492]
   [0.35133701 0.24914212 0.25632278 0.62202084 0.44463593]
   [0.76932932 0.76368062 0.67173279 0.57528139 0.19141534]
   [0.39764306 0.93538301 0.04580841 0.43151847 0.88632047]]

  [[0.78310536 0.12142801 0.43033277 0.97682331 0.37624081]
   [0.32095465 0.89554322 0.80179843 0.31590317 0.98404534]
   [0.45835424 0.04334572 0.53331907 0.10332215 0.10766188]
   [0.37454712 0.44691308 0.50645191 0.06003819 0.89792116]]]


 [[[0.23559808 0.11186871 0.74537817 0.77480432 0.3943096 ]
   [0.17925438 0.42741998 0.76370867 0.87989194 0.36730361]
   [

In [15]:
#linear array

#5 linearly spaced values between 0 and 1
lin_arr = np.linspace(0, 1, 5)
print(lin_arr)
print()

#10 linearly spaced values between 1 and 3
lin_arr_2 = np.linspace(1, 3, 10)
print(lin_arr_2)

[0.   0.25 0.5  0.75 1.  ]

[1.         1.22222222 1.44444444 1.66666667 1.88888889 2.11111111
 2.33333333 2.55555556 2.77777778 3.        ]


# Grading
The grading will be binary - all or nothing; no partial credit
- 0.3 for all the popcorn hacks
- 0.6 for the main hack - CSV file
- 0.1 for going above and beyond in the main hack