#### Name: Stuti Upadhyay
#### Campus ID: XT81177
#### Instructor: Chalachew Jemberie



## Pre-Class Exercise: Introduction to Statistical Analysis

### Objective
Prepare for our upcoming class on statistical analysis by familiarizing yourself with the fundamental concepts and applications of statistics in data analysis and data science. This exercise is designed to enhance your understanding and enable active participation during the class discussion.

### Instructions
Complete the tasks below before the class. Be ready to discuss your findings, insights, and any questions you might have.



#### Part 1: Exploring Descriptive Statistics
- **Research Task:**
  - Locate an online resource explaining descriptive statistics, focusing on measures of central tendency and variability. Summarize the key points in your own words.
  
- **Practical Task**
  - Perform a basic descriptive statistical analysis on a chosen dataset. Calculate the mean, median, mode, range, variance, and standard deviation for one of its numerical features. Document your code and findings.

### Answers

#### Research Task:

- One online resource that provides a good explanation of descriptive statistics, particularly focusing on measures of central tendency and variability, is the Khan Academy's Statistics and Probability section. Within this section, there are specific lessons dedicated to these concepts.

#### Key points on measures of central tendency include:

- Mean (Average): The sum of all values divided by the number of values. It represents the "average" value of a dataset.
- Median: The middle value in a dataset when it's ordered from least to greatest. It's less sensitive to outliers compared to the mean.
- Mode: The value that appears most frequently in a dataset. A dataset can have one mode, multiple modes, or no mode at all.

#### Key points on measures of variability include:

- Range: The difference between the maximum and minimum values in a dataset, providing a sense of spread.
- Variance: A measure of how spread out the values in a dataset are from the mean. It's calculated by averaging the squared differences from the mean.
- Standard Deviation: The square root of the variance. It represents the average distance of each observation from the mean.

These measures collectively provide insights into the distribution and characteristics of a dataset.

In [2]:
import pandas as pd

# Load the dataset
data = pd.read_csv('adult.csv')

# Choose a numerical feature for analysis
feature = data['age']

# Calculate mean, median, mode, range, variance, and standard deviation
mean = feature.mean()
median = feature.median()
mode = feature.mode()[0]  # In case of multiple modes, take the first one
data_range = feature.max() - feature.min()
variance = feature.var()
std_deviation = feature.std()

# Print the results
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Range:", data_range)
print("Variance:", variance)
print("Standard Deviation:", std_deviation)


Mean: 38.64358543876172
Median: 37.0
Mode: 36
Range: 73
Variance: 187.97808266246622
Standard Deviation: 13.71050993444322


#### Part 2: Understanding Inferential Statistics
- **Research Task:**
  - Investigate hypothesis testing and confidence intervals. Explain these concepts with examples in simple terms.
  
- **Reflection Task:**
  - Reflect on the significance of inferential statistics in data science. Consider its application in making predictions about a population based on a sample.

### Answers

#### Research Task:

Hypothesis testing and confidence intervals are fundamental concepts in inferential statistics.

- Hypothesis Testing: Hypothesis testing is a statistical method used to make inferences about a population based on a sample from that population. It involves formulating a hypothesis about the population parameter and testing whether the sample data provide enough evidence to support or reject that hypothesis. There are typically two types of hypotheses:

- Null Hypothesis (H0): This is the default assumption that there is no significant difference or effect.
Alternative Hypothesis (H1 or Ha): This is the opposite of the null hypothesis, representing the effect or difference you're trying to find evidence for.

Hypothesis testing involves calculating a test statistic from the sample data and comparing it to a critical value or p-value to determine whether to reject the null hypothesis in favor of the alternative hypothesis.

- Confidence Intervals: A confidence interval is a range of values that is likely to contain the true population parameter with a certain level of confidence. It provides a measure of uncertainty around the estimate obtained from the sample data. For example, a 95% confidence interval for the population mean indicates that if we were to draw multiple samples and calculate confidence intervals for each, about 95% of those intervals would contain the true population mean.

Example:
Suppose you want to test whether a new drug reduces blood pressure. You collect data from a sample of patients, administer the drug to them, and measure their blood pressure before and after treatment.

#### Hypothesis Testing:

- Null Hypothesis (H0): The drug has no effect on blood pressure (mean difference = 0).
Alternative Hypothesis (Ha): The drug reduces blood pressure (mean difference < 0).
You conduct a paired t-test on the data and find a p-value of 0.02. Since this p-value is less than your chosen significance level (e.g., 0.05), you reject the null hypothesis and conclude that there is evidence to suggest that the drug reduces blood pressure.

#### Confidence Interval:

- You calculate a 95% confidence interval for the mean difference in blood pressure before and after treatment as (-5.2, -1.8) mmHg. This means you are 95% confident that the true mean difference in blood pressure lies within this interval.

#### Reflection Task:

- Inferential statistics plays a crucial role in data science by allowing us to draw conclusions about populations based on samples. It helps us make informed decisions, draw meaningful insights, and make predictions.

- One significant application of inferential statistics is in making predictions about a population based on a sample. By using techniques like hypothesis testing and confidence intervals, data scientists can assess the likelihood of an observed effect being real and generalize findings from a sample to the larger population. This is essential in fields such as market research, healthcare, finance, and more, where decisions often rely on understanding and predicting population behavior or characteristics.

- Moreover, inferential statistics enables data scientists to quantify uncertainty and variability in their conclusions. Confidence intervals provide a range of plausible values for population parameters, acknowledging the inherent uncertainty in estimating these parameters from sample data. This helps stakeholders make more informed decisions by understanding the level of confidence associated with the estimated results.

Overall, inferential statistics empowers data scientists to go beyond merely describing data to making meaningful inferences and predictions, thereby driving evidence-based decision-making and solving real-world problems effectively.

#### Part 3: Introduction to Probability Distributions
- **Video Task:**
  - Watch a tutorial on probability distributions, with an emphasis on the Normal distribution. Note any interesting terms or concepts.
  
- **Application Task:**
  - Identify a real-life phenomenon that fits a Normal distribution. Explain your reasoning.


### Answers

#### Video Task:

After going through several tutorials on probability distributions, particularly focusing on the Normal distribution, several interesting terms and concepts might be highlighted:

- Probability Distribution: An arrangement of the probabilities of possible outcomes of a random variable. The Normal distribution is one of the most important probability distributions in statistics.

- Normal Distribution: Also known as the Gaussian distribution, it is characterized by a symmetric bell-shaped curve. It is defined by its mean (center) and standard deviation (spread). The properties of the normal distribution make it widely applicable in various fields due to the central limit theorem.

- Standard Normal Distribution: A special case of the normal distribution with a mean of 0 and a standard deviation of 1. Z-scores are often used to standardize data to this distribution.

- 68-95-99.7 Rule: This rule states that in a normal distribution, approximately 68% of the data falls within one standard deviation from the mean, about 95% falls within two standard deviations, and around 99.7% falls within three standard deviations.

- Applications of the Normal Distribution: The Normal distribution is commonly used in finance, biology, physics, social sciences, and many other fields to model various phenomena due to its ubiquity in nature and its mathematical properties.

#### Application Task:

A real-life phenomenon that fits a Normal distribution is human heights. Human heights tend to follow a Normal distribution in populations. Here's why:

- Biological Variation: Human heights are influenced by genetics, nutrition, environment, and other factors. The combination of these factors often results in a distribution of heights that resembles a bell curve.

- Central Limit Theorem: The heights of a large population tend to aggregate around the mean height, forming a bell-shaped distribution, as predicted by the central limit theorem. This theorem states that the distribution of the sample means of a population will be approximately normally distributed, regardless of the original distribution of the population.

- Observational Evidence: Empirical studies have shown that human heights in populations often closely approximate a Normal distribution. While there may be variations due to factors like gender and ethnicity, when considering large and diverse populations, the distribution of heights tends to exhibit the characteristic bell curve shape.

- Use in Applications: The Normal distribution assumption for human heights is widely used in various fields such as healthcare (e.g., in determining growth charts for children), ergonomics (e.g., designing furniture and equipment for human use), and anthropology (e.g., studying human evolution and migration patterns).

#### Part 4: Preparing for Regression Analysis
- **Concept Mapping Task:**
  - Create a concept map linking terms related to regression analysis (e.g., linear regression, logistic regression, coefficients).
  
- **Research Task:**
  - Summarize a study or project where regression analysis was employed. Discuss how it influenced the study's conclusions.

### Answers

#### Study Summary:

#### Title: "Impact of Socioeconomic Factors on Academic Performance: A Regression Analysis"
#### By: Kumaravel Udayakumar, Shanmugan Rajendran, Arumugam Sugirtha Rani

#### Summary: 

- In this study, researchers aimed to investigate the influence of socioeconomic factors on academic performance among high school students. They collected data on various socioeconomic indicators such as parental income, parental education level, access to educational resources, and neighborhood characteristics, along with students' academic performance metrics like GPA and standardized test scores.

#### Regression Analysis:

- The researchers employed multiple linear regression analysis to examine the relationship between socioeconomic factors and academic performance. They treated academic performance metrics (GPA and test scores) as dependent variables and socioeconomic indicators as independent variables. By fitting a regression model, they quantified the extent to which each socioeconomic factor contributed to variations in academic performance.

#### Influence on Study's Conclusions:

1) Neighborhood Characteristics: The study found that students from neighborhoods with higher levels of socioeconomic disadvantage tended to have lower academic performance, even after controlling for other socioeconomic factors. This suggests that the environment in which students live can impact their educational outcomes.

2) Combined Effect: By examining the combined effect of multiple socioeconomic factors through regression analysis, the researchers were able to determine which factors had the most significant influence on academic performance. This information is crucial for policymakers and educators in designing interventions and support systems to address disparities in academic achievement.

3) Policy Implications: The findings from the regression analysis provided evidence to support the need for targeted interventions to address socioeconomic inequalities in education. Policies aimed at improving access to resources such as quality education, tutoring services, and educational facilities for students from disadvantaged backgrounds could help narrow the achievement gap.

4) Future Research Directions: The regression analysis also identified areas for further investigation, such as exploring the mechanisms through which socioeconomic factors impact academic performance and examining the long-term effects of interventions aimed at mitigating these disparities.

- Overall, the regression analysis played a central role in shaping the conclusions of the study by providing empirical evidence of the relationship between socioeconomic factors and academic performance among high school students. It highlighted the importance of addressing socioeconomic inequalities in education to promote equitable opportunities for all students.


### Submission Guidelines
- Compile your responses, code, and reflections into a document.
- Be prepared to share and discuss during the class. 
- Ensure clarity and organization in your submission.