In [1]:
import pandas as pd
import numpy as np
from figure_labeler import *

from IPython.display import HTML
HTML('''
<script
    src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
''')

In [2]:
fl = FigureLabeler();
pd.options.mode.chained_assignment = None

<h1 style="text-align:center;">Using Pandas and SQL for EDA (Pandas Version)</h1>
<hr>

<a name="top"></a>
#### Table of Contents:

[ref0]: #exec_summary
- [Executive Summary][ref0]

[ref2]: #motiv
- [Motivation][ref2]

[ref3]: #dat_sor
- [Data Source][ref3]

[ref4]: #dat_prep
- [Importing, Preprocessing, and EDA][ref4]

[ref6]: #res_dis
- [Conclusion][ref6]

***

<a name="exec_summary"></a>
## Executive Summary
***

The objective is to demonstrate the combined use of Pandas and SQL for exploratory data analysis (EDA) through two separate Jupyter notebooks (ipynbs). One notebook will focus on Pandas while the other on SQL, showcasing their respective strengths in data manipulation and querying. The analysis will highlight how these tools can be seamlessly integrated to efficiently explore and gain insights from datasets, thereby providing a comprehensive understanding of the data's characteristics and patterns.This notebook will focus on the Pandas Version.

[ref]: #top
[Back to Table of Contents][ref]

<a name="motiv"></a>
## Motivation
***

Motivated by the need for comprehensive exploratory data analysis (EDA) workflows, we aim to showcase the powerful synergy and similarities between two widely-used tools: Pandas and SQL. While Pandas excels in data manipulation and analysis within Python environments, SQL offers robust querying capabilities ideal for handling large datasets efficiently.

[ref]: #top
[Back to Table of Contents][ref]

<a name="dat_sor"></a>
## Data Source
***

The dataset was taken from this link
[Student Study Performance Dataset](https://www.kaggle.com/datasets/bhavikjikadara/student-study-performance)


[ref]: #top
[Back to Table of Contents][ref]

<a name="dat_prep"></a>
## Importing, Preprocessing and EDA
***

In our exploratory data analysis (EDA), we harnessed the power of Pandas to import, preprocess, and gain insights from the dataset. Employing Pandas' read_csv() function, we loaded the data into a DataFrame, then utilized methods like .head() to preview the initial rows, .dtypes to inspect data types, and .describe() to generate summary statistics for numerical columns. By employing .value_counts(), we explored categorical variable distributions and used .isnull().sum() to identify missing values, crucial for data integrity. Additionally, .nunique() aided in understanding unique value counts per column. This meticulous process laid a robust foundation for extracting meaningful insights, guiding subsequent analytical steps effectively.

[ref]: #top
[Back to Table of Contents][ref]

In [3]:
perf = pd.read_csv('study_performance.csv')

In [4]:
fl.table_caption("Head of Study Performance of Students",
                 "The first 5 rows of the Data on Study Performance of Students")
perf.head()

Unnamed: 0,gender,race_ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [5]:
perf.dtypes

gender                         object
race_ethnicity                 object
parental_level_of_education    object
lunch                          object
test_preparation_course        object
math_score                      int64
reading_score                   int64
writing_score                   int64
dtype: object

The different types Data in our dataset.

In [6]:
fl.table_caption("Describe of Study Performance",
                 "Description of different variables using .describe on the data")
perf.describe()

Unnamed: 0,math_score,reading_score,writing_score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [7]:
perf['gender'].value_counts()

gender
female    518
male      482
Name: count, dtype: int64

Count of Gender Distribution

In [8]:
perf['race_ethnicity'].value_counts()

race_ethnicity
group C    319
group D    262
group B    190
group E    140
group A     89
Name: count, dtype: int64

Count of Ethnicity Groups of Students

In [9]:
perf.isnull().sum()

gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

No NULL Values

In [10]:
perf.nunique()

gender                          2
race_ethnicity                  5
parental_level_of_education     6
lunch                           2
test_preparation_course         2
math_score                     81
reading_score                  72
writing_score                  77
dtype: int64

Various Unique Values of our Dataset.

<a name="res_dis"></a>
## Conclusion
***

In conclusion, our exploration utilizing Pandas for data import, preprocessing, and exploratory analysis underscores its versatility and efficiency in handling diverse datasets. By seamlessly integrating Pandas' rich functionalities, from loading data to summarizing statistics and identifying patterns, we've established a comprehensive understanding of the dataset's characteristics. Through meticulous examination of data types, missing values, and unique value counts, we've ensured data integrity and prepared a solid groundwork for further analysis. This EDA not only provides valuable insights into the dataset at hand but also demonstrates the power of Pandas as a fundamental tool for data exploration in diverse domains, empowering data practitioners to derive actionable insights and make informed decisions.



[ref]: #top
[Back to Table of Contents][ref]