# __Pandas DataFrame__

## __Agenda__

In this lesson, we will cover the following concepts with the help of examples:

- Introduction to Pandas DataFrame
  * Creating a DataFrame from Various Methods
  * Accessing the DataFrame
  * Understanding DataFrame Basics
- Introduction to Statistical Operations in Pandas
  * Descriptive Statistics
  * Mean, Median, and Standard Deviation
  * Correlation Analysis

## __1. Introduction to Pandas DataFrame__

A Pandas DataFrame is a two-dimensional, tabular data structure with labeled axes (Rows and columns). 

It is a primary data structure in the Pandas library, providing a versatile and efficient way to handle and manipulate data in Python.

![image.png](attachment:17113d99-3119-4b69-a615-e4d67afc3b60.png)

### __Key Features:__
- __Tabular structure:__ The DataFrame is organized as a table with rows and columns, similar to a spreadsheet or SQL table.

- __Labeled axes:__ Both rows and columns are labeled, allowing for easy indexing and referencing of data.

- __Heterogeneous data types:__ Each column in a DataFrame can contain different types of data, such as integers, floats, strings, or even complex objects.

- __Versatility:__ DataFrames can store and handle a wide range of data formats, including CSV, Excel, SQL databases, and more.

- __Data alignment:__ Operations on DataFrames are designed to handle missing values gracefully, aligning data based on labels.

### __1.1 Creating a DataFrame from Various Methods__
The creation of a Pandas DataFrame stands as a foundational step in the realm of data analysis and manipulation.
- Diverse methods are available within Pandas to generate a DataFrame, addressing various data sources and structures.
- Data, whether in Python dictionaries, lists, NumPy arrays, or external files such as CSV and Excel, can be seamlessly transformed into a structured tabular format by Pandas.

In [None]:
import pandas as pd

# Creating a DataFrame from a dictionary
data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [25, 30, 22],
             'Salary': [50000, 60000, 45000]}

df_dict = pd.DataFrame(data_dict,index=['a','b','c'])
df_dict

In [None]:
df_dict.index

In [None]:
df_dict.columns

In [None]:
df_dict.values

In [None]:
df_dict.ndim

In [None]:
df_dict.shape

In [None]:
df_dict

In [None]:
# loc and iloc
df_dict.iloc[1,2]

In [None]:
df_dict.loc['b','Name']

In [None]:
# slicing
df_dict.iloc[0:2]

In [None]:
# slicing
df_dict.loc['a':'c']

In [None]:
import numpy as np
x = np.random.randint(0,100,size=(4,5))

In [None]:
df = pd.DataFrame(data=x, index=list('abcd'), columns=list('efghi'))
df

In [None]:
import pandas as pd

# Creating a DataFrame from a dictionary
data_dict = {'Name': ['Alice', 'Bob', 'Charlie'],
             'Age': [25, 30, 22],
             'Salary': [50000, 60000, 45000]}

df_dict = pd.DataFrame(data_dict)
print(df_dict)

# Creating a DataFrame from lists
data_list = [['Alice', 25, 50000], ['Bob', 30, 60000], ['Charlie', 22, 45000]]

# Defining column names
columns = ['Name', 'Age', 'Salary']

df_list = pd.DataFrame(data_list, columns=columns)
print(df_list)

# Creating a DataFrame from a NumPy array
import numpy as np
data_array = np.array([['Alice', 25, 50000],
                       ['Bob', 30, 60000],
                       ['Charlie', 22, 45000]])

df_array = pd.DataFrame(data_array, columns=columns)
print(df_array)

# # Creating a DataFrame from a CSV file
# df_csv = pd.read_csv('HousePrices.csv')
# print(df_csv)

# # Creating a DataFrame from an Excel file
# df_excel = pd.read_excel('Iris.xlsx')
# print(df_excel)

In [None]:
df2 = pd.read_csv("HousePrices.csv")
df2

In [None]:
df2.dtypes

In [None]:
df3 = pd.read_excel("Iris.xlsx")
df3

In [None]:
df2['sepal_length']

In [None]:
df2[['sepal_length','sepal_width']]

In [None]:
df2[df2['sepal_length'] < 5]

In [None]:
import sys
sys.getsizeof(df2)

### __1.2 Accessing the DataFrame__

Accessing a Pandas DataFrame involves employing various methods for selecting and retrieving data, whether it be specific columns, rows, or individual cells. 
- Utilizing square brackets, iloc and loc indexers, and conditions, analysts can navigate and extract the necessary information from the DataFrame for further analysis and manipulation. 
- The flexibility of Pandas allows for both label-based and position-based indexing, offering a versatile toolkit for accessing and working with data efficiently.

In [None]:
import pandas as pd

# Creating a sample DataFrame
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}

df = pd.DataFrame(data)

# Accessing a single column
column_data = df['Column_name']
print("Single column:")
print(column_data)

# Accessing multiple columns
selected_columns = df[['Column1', 'Column2']]
print("\nMultiple columns:")
print(selected_columns)

# Accessing a specific row by index
row_data = df.iloc[0]
print("\nSpecific row:")
print(row_data)

# Accessing rows based on a condition
filtered_rows = df[df['Column_name'] > 10]
print("\nFiltered rows:")
print(filtered_rows)

# Accessing a single cell by label
value = df.at[0, 'Column_name']
print("\nSingle cell by label:")
print(value)

# Accessing a single cell by position
value = df.iat[0, 1]  # Row 0, Column 1
print("\nSingle cell by position:")
print(value)

# Accessing data using .loc
selected_data = df.loc[0, 'Column_name']
print("\nData using .loc:")
print(selected_data)

# Conditional access
selected_data = df[df['Column_name'] > 10]['Another_column']
print("\nConditional access:")
print(selected_data)


### __1.3 Understanding DataFrame Basics__
- The head() and tail() methods enable users to efficiently preview the initial and final rows of a DataFrame, offering a quick snapshot of its structure and content. 
- These functions are invaluable for a preliminary assessment of column names, data types, and potential issues. Additionally, the info() method provides a comprehensive summary, detailing data types, non-null counts, and memory usage, aiding in the identification of missing or inconsistent data. 
- The shape attribute, on the other hand, succinctly communicates the dimensions of the DataFrame, encapsulating the number of rows and columns.
- The syntax for some functions is provided below:

![image.png](attachment:abb1b0c7-34f9-46a3-819c-12d3822c2d18.png)

In [None]:
df = pd.read_csv("IPL IMB381IPL2013.csv")
df

In [None]:
df.head()

In [None]:
df.tail() # last 5 entries of the data

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info() # information about the df

In [None]:
# selecting the required columns
df[['PLAYER NAME','COUNTRY','TEAM','PLAYING ROLE']]

In [None]:
# get the summary, how many players from each country

In [None]:
df.head()

In [None]:
# find unique occurance in country column
df['COUNTRY'].value_counts()

In [None]:
#Statistical Analysis on the dataset
df.describe()

In [None]:
# Select the player name, sold price, and sort the dataframe based on sold price in descending order

In [None]:
df[['PLAYER NAME', 'SOLD PRICE']].sort_values(by='SOLD PRICE') # ascending order

In [None]:
df[['PLAYER NAME', 'SOLD PRICE']].sort_values(by='SOLD PRICE', ascending=False) # ascending order

In [None]:
df[['PLAYER NAME', 'SOLD PRICE']].sort_values(by='SOLD PRICE', ascending=False).head()

In [None]:
df.head()

In [None]:
# Which type of player (PLAYING ROLE) would earn more ?
df.groupby("PLAYING ROLE").mean()

In [None]:
df.groupby("PLAYING ROLE").mean()['SOLD PRICE']

In [None]:
# Boolean Masking
df[df['SIXERS'] > 75]

In [None]:
# Removing the columns
df.head()

In [None]:
df2 = df.drop(columns=['Sl.NO.'])

In [None]:
df2.head()

In [None]:
import pandas as pd

# Create a sample DataFrame
data = {'Column_name': [5, 15, 8],
        'Column1': [10, 20, 30],
        'Column2': [100, 200, 300],
        'Another_column': [25, 35, 45]}

df = pd.DataFrame(data)

# Display the first 2 rows
print("First 2 rows:")
print(df.head(2))

# Display the last row
print("\nLast row:")
print(df.tail(1))

# Provide a comprehensive summary of the DataFrame
print("\nDataFrame summary:")
df.info()

# Return a tuple representing the dimensions of the DataFrame (Rows, columns)
print("\nDataFrame dimensions:")
print(df.shape)


## __2. Introduction to Statistical Operations in Pandas__
Pandas supports the computation of fundamental measures such as mean and median, along with the exploration of correlations and distribution characteristics. 

The following examples illustrate key statistical operations available in Pandas:

### __2.1 Descriptive Statistics__
It offers a snapshot of the dataset's central tendencies and dispersions. 

The describe() function provides a quick summary, including mean, standard deviation, and quartile information.

In [None]:
import pandas as pd

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Display descriptive statistics for numeric columns
print("Descriptive statistics for numeric columns:")
print(df.describe())


### __2.1 Mean, Median, and Standard Deviation__

In [None]:
import pandas as pd

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Calculate mean, median, and standard deviation
mean_value = df.mean()
median_value = df.median()
std_deviation = df.std()

print("Mean:\n", mean_value)
print("\nMedian:\n", median_value)
print("\nStandard deviation:\n", std_deviation)


### __2.2 Correlation Analysis__
The corr() function generates a correlation matrix, indicating how variables relate to each other.

Values closer to 1 or -1 imply a stronger correlation, while values near 0 suggest a weaker correlation.

In [None]:
import pandas as pd

# Create a sample DataFrame with numeric columns
data = {'Numeric_column1': [5, 15, 8],
        'Numeric_column2': [10, 20, 30],
        'Numeric_column3': [100, 200, 300]}

df = pd.DataFrame(data)

# Compute correlation matrix
correlation_matrix = df.corr()

print("Correlation matrix:\n", correlation_matrix)


#### __Value Counts__
The value_counts() function tallies the occurrences of unique values in a categorical column, aiding in understanding the distribution of categorical data.

In [None]:
import pandas as pd

# Create a sample DataFrame with a category column
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C']}
df = pd.DataFrame(data)

# Count occurrences of unique values in the category column
value_counts = df['Category'].value_counts()

print("Value counts:\n", value_counts)


# __Assisted Practice__

## __Problem Statement:__
Analyze a housing dataset using Pandas DataFrame and statistical operations to understand the basic characteristics of the data and the relationships between different variables.

## __Steps to Perform:__
- Load the housing dataset into a Pandas DataFrame
- Familiarize with the DataFrame basics such as its structure, data types of the columns, and summary statistics
- Calculate descriptive statistics like mean, median, and standard deviation for numerical columns 
- Count the number of occurrences of each category in categorical variables such as __city__, __condition__
- What is the average price per city (groupby) and sort it in descending order

In [None]:
df = pd.read_csv("HousePrices.csv")
df.info()

In [None]:
df.describe()

In [None]:
df.mean()

In [None]:
df.median()

In [None]:
df.std()

In [None]:
df['city'].value_counts()

In [None]:
df['condition'].value_counts()

In [None]:
#What is the average price per city (groupby) and sort it in descending order
df.groupby('city').mean().sort_values('price',ascending=False)['price']

In [None]:
df2[['T-RUNS','T-WKTS','BASE PRICE','SOLD PRICE']].corr()

In [None]:
df.groupby('city').mean()['price']

In [None]:
df.to_csv('final_processed.csv')

In [None]:
df.to_