<!-- Data analysis is a crucial skill in today's world, where vast amounts of data are generated and collected across various domains. Python, with its rich ecosystem of libraries and tools, has become one of the most popular programming languages for data analysis. In this introduction, we will explore the basics of data analysis using Python.

Installing Python and Required Libraries:
To get started, you need to install Python on your system. You can download and install the latest version of Python from the official website (python.org). Additionally, Python offers several powerful libraries for data analysis, such as NumPy, Pandas, and Matplotlib. You can install these libraries using the Python package manager, pip. -->

In [1]:
# Importing Libraries:

# Once you have installed the required libraries, you can import them into your Python script or Jupyter Notebook. Here's an 
# example of how to import NumPy, Pandas, and Matplotlib:
    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
# Loading Data:

# Data analysis typically begins with loading data into Python. Pandas provides efficient data structures, such as DataFrames, 
# for handling and analyzing structured data. You can read data from various sources like CSV files, Excel files, databases, 
# etc. Here's an example of loading a CSV file using Pandas:

data = pd.read_csv(r'C:\Users\sisir.sahu\Documents\GitHub\Python for Data science\Data Analysis in Python\IPL Matches 2008-2020.csv')

In [8]:
# Exploring the Data:

# After loading the data, it's essential to understand its structure and contents. Pandas offers several functions for exploring 
# the data, such as head(), info(), describe(), and shape. These functions provide insights into the data's dimensions, data 
# types, summary statistics, and a preview of the first few rows.

print(data.head())       # Preview the first few rows

       id        city        date player_of_match  \
0  335982   Bangalore  2008-04-18     BB McCullum   
1  335983  Chandigarh  2008-04-19      MEK Hussey   
2  335984       Delhi  2008-04-19     MF Maharoof   
3  335985      Mumbai  2008-04-20      MV Boucher   
4  335986     Kolkata  2008-04-20       DJ Hussey   

                                        venue  neutral_venue  \
0                       M Chinnaswamy Stadium              0   
1  Punjab Cricket Association Stadium, Mohali              0   
2                            Feroz Shah Kotla              0   
3                            Wankhede Stadium              0   
4                                Eden Gardens              0   

                         team1                        team2  \
0  Royal Challengers Bangalore        Kolkata Knight Riders   
1              Kings XI Punjab          Chennai Super Kings   
2             Delhi Daredevils             Rajasthan Royals   
3               Mumbai Indians  Royal Challe

In [9]:
print(data.info())       # Get information about the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               816 non-null    int64  
 1   city             803 non-null    object 
 2   date             816 non-null    object 
 3   player_of_match  812 non-null    object 
 4   venue            816 non-null    object 
 5   neutral_venue    816 non-null    int64  
 6   team1            816 non-null    object 
 7   team2            816 non-null    object 
 8   toss_winner      816 non-null    object 
 9   toss_decision    816 non-null    object 
 10  winner           812 non-null    object 
 11  result           812 non-null    object 
 12  result_margin    799 non-null    float64
 13  eliminator       812 non-null    object 
 14  method           19 non-null     object 
 15  umpire1          816 non-null    object 
 16  umpire2          816 non-null    object 
dtypes: float64(1), i

In [10]:
print(data.describe())   # Summary statistics of the data

                 id  neutral_venue  result_margin
count  8.160000e+02     816.000000     799.000000
mean   7.563496e+05       0.094363      17.321652
std    3.058943e+05       0.292512      22.068427
min    3.359820e+05       0.000000       1.000000
25%    5.012278e+05       0.000000       6.000000
50%    7.292980e+05       0.000000       8.000000
75%    1.082626e+06       0.000000      19.500000
max    1.237181e+06       1.000000     146.000000


In [11]:
print(data.shape)        # Dimensions of the data (rows, columns)

(816, 17)


In [None]:
# Data Cleaning and Preprocessing:

# Data often requires cleaning and preprocessing before analysis. Pandas offers numerous functionalities for handling missing 
# data, removing duplicates, transforming data types, and more. Some common operations include dropna(), fillna(), 
# drop_duplicates(), and astype().

In [None]:
# Data Visualization:

# Visualizing data helps in understanding patterns, trends, and relationships. Matplotlib is a popular library for creating 
# various types of plots, such as line plots, bar plots, scatter plots, histograms, etc. Here's an example of creating a line
# plot using Matplotlib:
    
plt.plot(data['x'], data['y'])
plt.xlabel('x')
plt.ylabel('y')
plt.title('Line Plot')
plt.show()

In [None]:
# Data Analysis and Manipulation:

# Pandas provides powerful tools for data analysis and manipulation. You can filter data, perform aggregations, sort and rank 
# data, apply mathematical operations, merge datasets, and more. These operations allow you to extract valuable insights from 
# the data efficiently.

In [None]:
# Statistical Analysis and Machine Learning:
    
# Python offers a wide range of statistical analysis and machine learning libraries, such as SciPy, scikit-learn, and 
# StatsModels. These libraries provide functions for hypothesis testing, regression analysis, clustering, classification, 
# and other advanced analytical tasks.

In [None]:
# Data Filtering and Selection:
    
# Pandas allows you to filter and select specific subsets of data based on certain conditions. Here's an example that filters 
# a DataFrame to include only rows where a specific column meets a certain criteria:
    
filtered_data = data[data['column'] > 10]

In [None]:
# Data Aggregation:
    
# Aggregating data involves grouping rows based on a specific column and computing summary statistics. Pandas provides the 
# groupby() function for this purpose. Here's an example that groups a DataFrame by a categorical variable and calculates the 
# mean of a numeric column:

grouped_data = data.groupby('category')['numeric_column'].mean()

In [None]:
# Data Visualization with Seaborn:
    
# Seaborn is a Python library built on top of Matplotlib that provides high-level statistical data visualization capabilities. 
# Here's an example of creating a scatter plot with a regression line using Seaborn:

import seaborn as sns

sns.scatterplot(x='x', y='y', data=data)
sns.regplot(x='x', y='y', data=data, scatter=False)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Scatter Plot with Regression Line')
plt.show()

In [13]:
# Handling Time Series Data:
    
# Python has dedicated libraries like Pandas and NumPy for handling time series data. Here's an example of resampling time 
# series data to a lower frequency (e.g., converting daily data to monthly data) using Pandas:

# Assuming 'date' is a column representing dates in the DataFrame
data['date'] = pd.to_datetime(data['date'])
data_resampled = data.resample('M', on='date').sum()

In [None]:
# Machine Learning - Linear Regression:
    
# Scikit-learn is a powerful library for machine learning in Python. Here's an example of performing linear regression on a 
# dataset using scikit-learn:
    
from sklearn.linear_model import LinearRegression

X = data[['feature1', 'feature2']]  # Input features
y = data['target']                  # Target variable

model = LinearRegression()
model.fit(X, y)

# Predicting on new data
new_data = pd.DataFrame({'feature1': [1, 2, 3], 'feature2': [4, 5, 6]})
predictions = model.predict(new_data)