## Exploratory Data Analysis (EDA)
### Visual inspection
Exploring characteristics of your data is a critical step in any data science project. Using visualization libraries like Matplotlib and Seaborn can greatly assist in this process, making it easier to understand patterns, relationships, and structures within your data. Let's dive into some specifics.
#### Understanding distributions
One of the first steps in exploring your data could be understanding the distribution of various features. Histograms, box plots, and violin plots are commonly used for this purpose.
##### Histogram
Histogram: A histogram shows the frequency of different values in a dataset. In seaborn, you can use sns.histplot() to create histograms.

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np

data = np.random.normal(size=100)
sns.histplot(data)

##### Box plot
A box plot is used to depict groups of numerical data through their quartiles. It can give you a better understanding of the spread and skewness of your data. Outliers can also be spotted using box plots. Seaborn's sns.boxplot() can be used to create these.

In [None]:
data = pd.DataFrame(np.random.rand(50, 4), columns=['A', 'B', 'C', 'D'])
sns.boxplot(data=data)

##### Violin plot
A violin plot combines the benefits of the previous two plots and simplifies them. It shows the distribution of quantitative data across several levels of one (or more) categorical variables. Use sns.violinplot() to create violin plots.

In [None]:
tips = sns.load_dataset("tips")
sns.violinplot(x=tips["total_bill"])

#### Understanding relationships 
If your data has multiple features, it's often useful to understand how these features relate to each other. Scatter plots, line plots, and correlation heatmaps can be useful here.
##### Scatter plot
Scatter plots can help visualize the relationship between two numerical variables. In seaborn, you can use sns.scatterplot() to create scatter plots.

In [None]:
iris = sns.load_dataset("iris")
sns.scatterplot(x='sepal_length', y='sepal_width', data=iris)

##### Line plot
A line plot is a way to display data along a number line. Line plots are used to track changes over periods of time. When smaller changes exist, line plots are better to use than bar plots.

In [None]:
years = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]
revenue = [200, 250, 275, 300, 350, 400, 450, 475, 525, 575]
sns.lineplot(x=years, y=revenue)

##### Heatmap
A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heatmaps are used in various forms of analytics but are most commonly used to show user behaviour on specific webpages or webpage templates.

In [None]:
# correlation matrix
corr = iris.corr()
sns.heatmap(corr, annot=True)

### Formal techniques
EDA is such a critical step in the data science pipeline. It involves examining the data to understand their main characteristics often with visual methods. Here, I will outline some key statistical techniques, both parametric and non-parametric, used during EDA.
#### Parametric methods
Parametric methods assume that data has a specific distribution, typically a Gaussian (normal) distribution. The parameters of the normal distribution, mean and standard deviation, summarize and sufficiently represent the data.
##### Mean: 
It provides the central tendency of the dataset. Mean is the sum of all values divided by the number of values.

In [None]:
import numpy as np
data = np.array([1, 2, 3, 4, 5])
mean = np.mean(data)
print('Mean:', mean)

##### Standard Deviation
It quantifies the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out.

In [None]:
std_dev = np.std(data)
print('Standard Deviation:', std_dev)

##### Correlation
It measures the degree to which two variables are linearly related. If we have more than two variables, we typically use a correlation matrix.

In [None]:
import pandas as pd
data = pd.DataFrame({'A': np.random.rand(50), 'B': np.random.rand(50)})
correlation = data.corr()
print('Correlation:\n', correlation)

##### T-tests
These are used to determine if there is a significant difference between the means of two groups. In Python, you can use the scipy.stats.ttest_ind() function to conduct a t-test.

In [None]:
import numpy as np
from scipy import stats

# Create three sets of data
np.random.seed(0)  # for reproducibility
group1 = np.random.normal(50, 10, size=50)
group2 = np.random.normal(60, 10, size=50)
group3 = np.random.normal(55, 10, size=50)

# Perform a two-sample t-test on group1 and group2
t_stat, p_val = stats.ttest_ind(group1, group2)

print("t-statistic: ", t_stat)
print("p-value: ", p_val)

##### Analysis of Variance (ANOVA): This is used to analyze the difference among group means in a sample. In Python, you can use the scipy.stats.f_oneway() function for ANOVA.

In [None]:
# Perform one-way ANOVA
F_stat, p_val = stats.f_oneway(group1, group2, group3)

print("F-statistic: ", F_stat)
print("p-value: ", p_val)

#### Non-parametric methods
Non-parametric methods come in handy when the data does not fit a normal distribution. These methods are based on ranks and medians.
##### Median
It is the value separating the higher half from the lower half of a data sample. It provides the central tendency of the dataset.

In [None]:
median = np.median(data)
print('Median:', median)

##### Interquartile Range (IQR): 
This is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It is a measure of statistical dispersion.

In [None]:
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
IQR = Q3 - Q1
print('IQR:', IQR)

##### Spearman’s Rank Correlation 
It assesses how well the relationship between two variables can be described using a monotonic function.

In [None]:
from scipy.stats import spearmanr
data = pd.DataFrame({'A': np.random.rand(50), 'B': np.random.rand(50)})
correlation, _ = spearmanr(data['A'], data['B'])
print('Spearmans correlation: %.3f' % correlation)

##### Mann-Whitney U Test
It is used to compare differences between two independent data samples. In Python, you can use the scipy.stats.mannwhitneyu() function for this test.

In [None]:
# Perform Mann-Whitney U Test on group1 and group2
u_stat, p_val = stats.mannwhitneyu(group1, group2)

print("U-statistic: ", u_stat)
print("p-value: ", p_val)

##### Kruskal-Wallis H Test: 
This test is used when the assumptions of one-way ANOVA are not met. It's a rank-based nonparametric test that can be used to determine if there are statistically significant differences between two or more groups.

In [None]:
# Perform Kruskal-Wallis H Test
h_stat, p_val = stats.kruskal(group1, group2, group3)

print("H-statistic: ", h_stat)
print("p-value: ", p_val)

### Statistical inference
Statistical inference is the process of making judgments about a population based on sampling properties. It consists of selecting and modeling the data appropriately and interpreting the results correctly. There are two major types of statistical inference: estimation (point estimates and confidence intervals) and hypothesis testing.
#### Estimation
##### Point Estimate
It is a single value estimate of a parameter. For instance, the sample mean is a point estimate of the population mean.

In [None]:
import numpy as np

# Generating a sample
np.random.seed(0)
population = np.random.normal(loc=70, scale=10, size=1000000)
sample = np.random.choice(population, size=1000)

# Point estimate of the mean
point_estimate = np.mean(sample)
print('Point Estimate of Mean:', point_estimate)


##### Confidence Interval
A range of values that likely contains the population parameter. For example, a 95% confidence interval implies that if we pull 100 samples and create confidence intervals for each, 95 of those intervals would contain the population mean.

In [None]:
from scipy.stats import sem, t

# Confidence interval
confidence = 0.95
sample_stderr = sem(sample)  # Standard error of the mean
interval = sample_stderr * t.ppf((1 + confidence) / 2., len(sample) - 1)

print('Confidence interval for the mean:', (point_estimate - interval, point_estimate + interval))

#### Hypothesis Testing: 
Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. It is basically an assumption that we make about the population parameter.

Let's take an example, where we have a sample of weights and we are testing if the mean of the weights is significantly different from 70.

In [None]:
# Null Hypothesis: The mean weight is 70
# Alternative Hypothesis: The mean weight is not 70

from scipy.stats import ttest_1samp

t_statistic, p_value = ttest_1samp(sample, 70)

print('t-statistic:', t_statistic)
print('p-value:', p_value)

if p_value < 0.05:  # alpha value is 0.05 or 5%
    print("We are rejecting null hypothesis")
else:
    print("We are accepting null hypothesis")