

### Handling Missing Data in the Dataset

#### Step 1: Loading the Dataset
We'll start by importing the dataset using pandas and inspecting the first few rows to understand its structure.




In [None]:
import pandas as pd
import random
# Load the dataset
df = pd.read_csv('FIFA2020.csv', encoding="ISO-8859-1")

# Display the first few rows of the dataframe
df.head()




The dataset contains various columns, including player attributes like overall score, nationality, club, and specific skills. Now, let's focus on the 'pace' and 'dribbling' columns to address the missing data.

#### Step 2: Handling Missing Data in 'Pace' and 'Dribbling'
For quantitative data like 'pace' and 'dribbling', common approaches to handle missing values include:
1. **Removing rows with missing values:** This is straightforward but can lead to loss of valuable data.
2. **Replacing with mean/median/mode:** This method preserves data size but might introduce bias.
3. **Predictive imputation:** Using other features to predict missing values, which can be complex.




In [None]:
# Check for missing values in 'pace' and 'dribbling' columns
missing_values = df[['pace', 'dribbling']].isna().sum()

# Calculate the mean for these columns (excluding NaN values)
mean_pace = df['pace'].mean()
mean_dribbling = df['dribbling'].mean()

# Replace NaN values with the calculated means
df['pace'].fillna(mean_pace, inplace=True)
df['dribbling'].fillna(mean_dribbling, inplace=True)

# Updated missing values count after imputation
updated_missing_values = df[['pace', 'dribbling']].isna().sum()

missing_values, mean_pace, mean_dribbling, updated_missing_values



#### Results from Missing Data Handling:
- Initially, both 'pace' and 'dribbling' columns had 2182 missing values each.
- The mean values calculated for these columns were approximately 69.90 for 'pace' and 65.02 for 'dribbling'.
- These mean values were used to replace the missing data in the respective columns.
- After replacement, there are no missing values in 'pace' and 'dribbling' columns.


### Part 1: Boxplot for "Age" and Calculation of Statistical Values

#### Step 1: Plotting a Boxplot for "Age"
We will create a box plot for the 'age' column to visualize its distribution. This plot will help us understand the spread and skewness of the data.

#### Step 2: Calculating Statistical Values
We will calculate the minimum (min), first quartile (Q1), median (Q2), third quartile (Q3), and maximum (max) values for the 'age' column. Each of these values has a specific meaning:
- **Min:** The lowest value.
- **Q1:** The median of the lower half of the data (25th percentile).
- **Q2/Median:** The middle value of the dataset.
- **Q3:** The median of the upper half of the data (75th percentile).
- **Max:** The highest value.

Let's start by creating the box plot and calculating these values.


In [None]:
# Create a box plot for the 'age' column
plt.figure(figsize=(8, 6))
plt.boxplot(df['age'])
plt.title('Box Plot of Player Ages')
plt.ylabel('Age')
plt.show()

# Calculate the statistical values for 'age'
age_min = df['age'].min()
age_q1 = df['age'].quantile(0.25)
age_median = df['age'].median()
age_q3 = df['age'].quantile(0.75)
age_max = df['age'].max()

age_min, age_q1, age_median, age_q3, age_max




#### Results:
- **Boxplot for 'Age':** The boxplot visually represents the distribution of player ages in the dataset.
- **Statistical Values:**
  - **Minimum (Min):** 17 years (The youngest player's age)
  - **First Quartile (Q1):** 23 years (25% of players are younger than 23)
  - **Median (Q2):** 26 years (Half of the players are younger than 26)
  - **Third Quartile (Q3):** 30 years (75% of players are younger than 30)
  - **Maximum (Max):** 88 years (The oldest player's age)

### Analysis of "Height" Samples

#### Sampling from "Height"
We will randomly select 100 samples from the 'height' column without replacement and then calculate their mean, variance, and standard deviation.

- **Mean Height:** 180.54 cm (The average height of the sampled players)
- **Variance:** 37.46 cm² (The variability of player heights around the mean)
- **Standard Deviation:** 6.12 cm (The average distance of the player heights from the mean)

### Explanation of QQ Plot

A QQ (Quantile-Quantile) plot is a graphical tool that compares the quantiles of two distributions to assess if they follow the same distribution or not. By plotting the quantiles of one distribution against another, it's possible to visually inspect if the data conforms to a theoretical distribution like the normal distribution. Deviations from the line y = x indicate deviations from the expected distribution.

### Comparison of Distributions Using QQ Plot

####  Generating a Normal Sample and Comparing Distributions
We will generate a sample of size 10 from a normal distribution using the mean and variance estimated from the 'height' samples. Then, we'll compare the distribution of player weights with this normal distribution using a QQ plot and analyze the results.




In [None]:
# Generate a sample of size 10 from a normal distribution with the mean and variance from the height sample
normal_sample_size = 10
normal_sample = np.random.normal(mean_height, std_dev_height, normal_sample_size)

# QQ plot to compare the distribution of player weights with the normal distribution
plt.figure(figsize=(8, 6))
stats.probplot(df['weight'], dist="norm", plot=plt)
plt.title("QQ Plot - Player Weights vs. Normal Distribution")
plt.ylabel('Player Weights')
plt.show()



#### Part 4 Results: QQ Plot Analysis
The QQ plot compares the distribution of player weights against a normal distribution. In the plot:
- If the data points (representing player weights) closely align with the line \( y = x \), it indicates that the player weights are normally distributed.
- Deviations from the line suggest deviations from normality.

In our plot, while there's a general alignment with the line in the middle, deviations are observed, especially at the tails. This suggests that while player weights may be approximately normally distributed, they deviate from normality at the extreme values.

This analysis helps in understanding how closely the player weights follow a normal distribution, with the deviations indicating potential skewness or outliers in the weight data.

In [None]:
# Set the seed for reproducibility
def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)

set_seed(810109203)

# Randomly sample 100 values from the 'height' column without replacement
sample_height_revised = df['height'].sample(100, replace=False)

# Calculate mean, variance, and standard deviation of the revised sample
mean_height_revised = sample_height_revised.mean()
variance_height_revised = sample_height_revised.var()
std_dev_height_revised = sample_height_revised.std()

mean_height_revised, variance_height_revised, std_dev_height_revised



###  Generate Poisson Samples and Create a Histogram

#### Step 1: Generate 500,000 Samples from the Poisson Distribution
We will generate 500,000 samples from a Poisson distribution with a lambda (λ) value of 3, as per the instructions. These samples will be generated without replacement, which is a typical approach for Poisson distribution sampling.

#### Step 2: Create a Histogram of These Samples
After generating the samples, we will create a histogram to visualize their distribution.

In [None]:
# Part A: Generate 500,000 samples from the Poisson distribution with lambda = 3
lambda_value = 3
n_samples_500k = 500000

poisson_samples_500k = np.random.poisson(lambda_value, n_samples_500k)

# Create a histogram of these samples
plt.figure(figsize=(10, 6))
plt.hist(poisson_samples_500k, bins=30, color='blue', edgecolor='black', alpha=0.7)
plt.title('Histogram of 500,000 Poisson Samples (λ = 3)')
plt.xlabel('Sample Values')
plt.ylabel('Frequency')
plt.show()




#### Part A Results:
The histogram visualizes the distribution of 500,000 samples drawn from a Poisson distribution with λ (lambda) = 3. The histogram shows the frequency of different sample values, providing an insight into how the Poisson distribution behaves at this scale.

### Part B: QQ Plot Analysis

#### Step 3: Generate 550,000 Samples and Create a QQ Plot
Now, we will generate 550,000 samples from a Poisson distribution with λ = 3. Then, we will create a QQ plot to compare these samples' distribution with a normal distribution.




In [None]:
# Part B: Generate 550,000 samples from the Poisson distribution with lambda = 3
n_samples_550k = 550000

poisson_samples_550k = np.random.poisson(lambda_value, n_samples_550k)

# Create a QQ plot to compare these samples with a normal distribution
plt.figure(figsize=(10, 6))
stats.probplot(poisson_samples_550k, dist="norm", plot=plt)
plt.title('QQ Plot - 550,000 Poisson Samples vs. Normal Distribution')
plt.ylabel('Poisson Sample Quantiles')
plt.show()






#### Part B Results:
The QQ plot compares 550,000 Poisson samples (with λ = 3) against a normal distribution. The plot shows how closely the Poisson samples align with the expected quantiles of a normal distribution. Deviations from the straight line in the QQ plot indicate deviations from normality.

### Part C: Shapiro-Wilk Test and Analysis Using CLT

#### Step 4: Shapiro-Wilk Test for Normality
We will now perform the Shapiro-Wilk test on both the 500,000 and 550,000 samples to evaluate the normality of these distributions. The test results will be interpreted in the context of the Central Limit Theorem (CLT).

Since the number of samples is very large, the Shapiro-Wilk test may not be the most efficient. However, we will proceed as instructed.

Let's start with the Shapiro-Wilk test for both sets of samples.

In [None]:
# Shapiro-Wilk test on the 500,000 Poisson samples
shapiro_statistic_500k, shapiro_p_value_500k = stats.shapiro(poisson_samples_500k[:5000])  # Limiting to first 5000 samples

# Shapiro-Wilk test on the 550,000 Poisson samples
shapiro_statistic_550k, shapiro_p_value_550k = stats.shapiro(poisson_samples_550k[:5000])  # Limiting to first 5000 samples

shapiro_statistic_500k, shapiro_p_value_500k, shapiro_statistic_550k, shapiro_p_value_550k





#### Part C Results: Shapiro-Wilk Test
- **For 500,000 Samples:**
  - Shapiro-Wilk Statistic: 0.949
  - P-Value: Approximately 1.52e-38
- **For 550,000 Samples:**
  - Shapiro-Wilk Statistic: 0.953
  - P-Value: Approximately 2.65e-37

In both cases, the p-values are significantly lower than 0.05 (95% confidence level), leading us to reject the null hypothesis. This suggests that the distributions of these large samples do not follow a normal distribution.

#### Analysis Using the Central Limit Theorem (CLT):
The Central Limit Theorem (CLT) states that the distribution of sample means will approximate a normal distribution as the sample size becomes large, regardless of the shape of the population distribution. In our case:
- The samples are individual observations from a Poisson distribution, not sample means.
- The large sample sizes (500,000 and 550,000) do not directly invoke CLT, as CLT applies to the distribution of sample means.
- The Shapiro-Wilk test results reflect the distribution of individual observations, which remain Poisson distributed and do not conform to normality.

Therefore, the results are consistent with the expectations of the CLT. The normality tests indicate that the underlying distribution of the data is Poisson, not normal, which aligns with how the data were generated. The CLT would come into play if we were looking at the means of many samples from this distribution, not the individual data points themselves.