# Previous Week

 - University Management System - Scratch
 - Random Password Generator
 - Text Cipher-Decipher

# This Week

## Statistics in data science
 - Topics:
 
        - Population: the source of data to be collected.
        - Sample: a portion of the population.
        - Variable: any data item that can be measured or counted.
        - Quantitative analysis (statistical): collecting and interpreting data with patterns and data visualization.
        - Qualitative analysis (non-statistical): producing generic information from other non-data forms of media.
        - Descriptive statistics: characteristics of a population.
        - Inferential statistics: predictions for a population.
        - Central tendency (measures of the center): mean (average of all values), median (central value of a data set), and mode (the most recurrent value in a data set).
        - Measures of the spread:
            - Range: the distance between each value in a data set.
            - Variance: the distance between a variable and its expected value.
            - Standard deviation: the dispersion of a data set from the mean.

# Getting Started with Pandas

- The pandas package is the most important tool at the disposal of Data Scientists and Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

## What's Pandas for?
 - Pandas has so many uses that it might make sense to list the things it can't do instead of what it can do.
 - This tool is essentially your data’s home. Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it.

# Pandas First Steps

## Install and import
 - Pandas is an easy package to install. Open up your terminal program (for Mac users) or command line (for PC users) and install it using either of the following commands:

In [None]:
conda install pandas
pip install pandas

## Alternatively, if you're currently viewing this article in a Jupyter notebook you can run this cell

In [None]:
!pip install pandas

## To import pandas we usually import it with a shorter name since it's used so much:

In [2]:
import pandas
pandas.__version__

'2.0.3'

# Core components of pandas: Series and DataFrames

 - The primary two components of pandas are the Series and DataFrame.

- A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series.
     - In other words a Series is a 1-Dimensional array

![image.png](attachment:image.png)

- DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

# Creating DataFrames from scratch

- Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs.

- There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

In [5]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

In [14]:
import pandas as pd
import numpy as np

purchases = pd.DataFrame(data)
purchases

Unnamed: 0,apples,oranges,pineapple
0,3,0,0
1,2,3,1
2,0,7,2
3,1,2,3


In [8]:
print(type(purchases))
print(type(purchases['apples']))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [35]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases.to_csv('purchases.csv')

In [11]:
purchases_2 = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David', 'Rishi'])
purchases_2

ValueError: Length of values (4) does not match length of index (5)

In [37]:
purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


In [20]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2],
    'pineapple': [0,1,pd.NaT,pd.NA]
}

df = pd.DataFrame(data)
print(df)

   apples  oranges pineapple
0       3        0         0
1       2        3         1
2       0        7       NaT
3       1        2      <NA>


## So now we could locate a customer's order by using their name:

In [8]:
purchases.loc['June']

apples     3
oranges    0
Name: June, dtype: int64

# How to read in data

In [39]:
df = pd.read_csv("purchases.csv", index_col=0)
#df_ = pd.read
df.tail(2)

Unnamed: 0,apples,oranges
Lily,0,7
David,1,2


# Understanding your data using Pandas

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, June to David
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   apples   4 non-null      int64
 1   oranges  4 non-null      int64
dtypes: int64(2)
memory usage: 96.0+ bytes


In [11]:
df.shape #(row, columm)

(4, 2)

In [12]:
df.describe()

Unnamed: 0,apples,oranges
count,4.0,4.0
mean,1.5,3.0
std,1.290994,2.94392
min,0.0,0.0
25%,0.75,1.5
50%,1.5,2.5
75%,2.25,4.0
max,3.0,7.0


# Count

In [13]:
df['apples'].count()
df.apples.count()

4

In [14]:
df['oranges'].count()

4

# Mean

In [15]:
df['apples'].mean()

1.5

In [16]:
df['oranges'].mean()

3.0

# Standard Deviation

In [17]:
df['apples'].std()

1.2909944487358056

In [18]:
df['oranges'].std()

2.943920288775949

In [19]:
df['apples'].min()

0

In [20]:
df['oranges'].min()

0

# Min

In [21]:
df['apples'].min()

0

In [22]:
df['oranges'].min()

0

![image.png](attachment:cc60ba15-9746-4581-8bf7-d4f058e98a8d.png)

Example:

    - Find the median, lower quartile and upper quartile of the following numbers. 
    12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25

Solution:
First, arrange the data in ascending order: 

![image.gif](attachment:f63c7ad4-2267-4e41-8788-384f715cbbf6.gif)
Median (middle value) = 22

Lower quartile (middle value of the lower half) = 12

Upper quartile (middle value of the upper half) = 36

If there is an even number of data items, then we need to get the average of the middle numbers.

# 1st Quartile

In [23]:
df['apples'].quantile(.25)

0.75

In [24]:
df['oranges'].quantile(.25)

1.5

# 2nd Qartile

In [25]:
df['apples'].quantile(.5)

1.5

In [26]:
df['oranges'].quantile(.5)

2.5

In [27]:
df['apples'].median()

1.5

In [28]:
df['oranges'].median()

2.5

# 3rd Quartile

In [29]:
df['apples'].quantile(.75)

2.25

In [30]:
df['oranges'].quantile(.75)

4.0

# Max

In [31]:
df['apples'].max()

3

In [32]:
df['oranges'].max()

7

# Upcoming Week

- Dealing with Null Values
- Dataset Preprocessing
- Data Analysis- Categorial
- Multi Column Analysis
- Data Analysis with Sorting
- Analysis with Custom DataFrames