<a href="https://colab.research.google.com/github/tamilmech/tamilselvan/blob/main/Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised Learning Algorithms:

## Linear Regression:

Used for regression tasks where the target variable is continuous.
Assumes a linear relationship between features and target variable.

## Logistic Regression:

Used for binary classification tasks.
Estimates the probability that a given input belongs to a certain category.

## Decision Trees:

Used for both classification and regression tasks.
Learns decision rules to split the data based on features.

## Random Forest:

Ensemble learning method combining multiple decision trees.
Reduces overfitting and increases accuracy compared to a single decision tree.

## Support Vector Machines (SVM):

Effective for both classification and regression tasks.
Finds the hyperplane that best separates classes in high-dimensional space.

## Gradient Boosting Machines (GBM):

Builds multiple weak models sequentially, each correcting errors of its predecessor.
Often used with decision trees as base learners.


# Unsupervised Learning Algorithms:

## K-Means Clustering:

Divides data points into k clusters based on feature similarity.
Minimizes intra-cluster variance.

## Principal Component Analysis (PCA):

Reduces dimensionality of data while retaining most important features.
Identifies patterns and relationships in high-dimensional data.

## Anomaly Detection:

Identifies outliers or anomalies in data that deviate significantly from the norm.
Can be based on statistical methods or machine learning algorithms.

## Association Rule Learning (e.g., Apriori Algorithm):

Discovers interesting relationships between variables in large datasets.
Commonly used in market basket analysis.

# Data Analysis Libraries

## Pandas:

Data manipulation and analysis library.
Provides data structures like DataFrame and Series.
Offers functionality for reading/writing data, data cleaning, filtering, grouping, and more.

## NumPy:

Fundamental package for scientific computing with Python.
Provides support for multidimensional arrays and matrices.
Offers mathematical functions to operate on these arrays.

##Matplotlib:

2D plotting library for creating static, interactive, and animated visualizations in Python.
Capable of creating various types of plots like line plots, scatter plots, histograms, etc.

## Seaborn:

Statistical data visualization library based on Matplotlib.
Provides a high-level interface for drawing attractive and informative statistical graphics.

##Scikit-learn

Simple and efficient tools for data mining and data analysis.
Provides various machine learning algorithms for classification, regression, clustering, dimensionality reduction, etc.

# Data Analysis Techniques

## Data Cleaning:

Handling missing values (fillna, dropna).
Removing duplicates (drop_duplicates).
Data type conversion (astype).

## Exploratory Data Analysis (EDA):

Summary statistics (describe).
Data visualization (Matplotlib, Seaborn).
Correlation analysis (corr).

## Data Manipulation:

Indexing and selecting data (loc, iloc).
Filtering data (query, boolean indexing).
Grouping and aggregation (groupby, agg).





## Feature Engineering:

Creating new features.
Handling categorical variables (one-hot encoding, label encoding).
Scaling and normalization.

## Statistical Analysis:

Hypothesis testing (t-test, ANOVA).
Regression analysis (OLS, logistic regression).
Time series analysis (ARIMA, seasonal decomposition).

## Machine Learning:

Model selection and evaluation (train_test_split, cross_val_score).
Model training and prediction.
Hyperparameter tuning (GridSearchCV, RandomizedSearchCV).

## Data Visualization:

Customizing plots (labels, titles, legends).
Plotting multiple subplots.
Interactive visualization (Plotly, Bokeh).

## Reporting and Presentation:

Generating reports (Jupyter Notebooks, Markdown).
Creating interactive dashboards (Dash, Streamlit).
Communicating findings effectively.

# NumPy

NumPy is a Python library for numerical computing, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently.

## Create an array from a list
arr = np.array([1, 2, 3])

## Create a 2D array (matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])

## Create arrays with specific values
zeros_arr = np.zeros((3, 3))  # Array of zeros
ones_arr = np.ones((2, 2))  # Array of ones
random_arr = np.random.rand(2, 2)  # Array of random numbers

## Element-wise addition, subtraction, multiplication, and division
arr1 + arr2
arr1 - arr2
arr1 * arr2
arr1 / arr2

## Dot product of arrays
np.dot(arr1, arr2)



## Transpose of a matrix
matrix.T




## Element-wise operations

```
np.sin(arr)
np.cos(arr)
np.exp(arr)
np.sqrt(arr)
```





## Reshape array
`arr.reshape((2, 2))`


## Dot product of arrays
`np.dot(arr1, arr2)`



## Element-wise operations


```
np.sin(arr)
np.cos(arr)
np.exp(arr)
np.sqrt(arr)
```




## Accessing elements


```
arr[0]  # Access first element
matrix[1, 2]  # Access element at row 1, column 2
```




## Slicing


```
arr[:2]  # First two elements
matrix[:, 1]  # Second column
matrix[1, :]  # Second row
```




## Sum, mean, min, max


```
np.sum(arr)
np.mean(matrix)
np.min(matrix)
np.max(matrix)
```





## Aggregation along axis


```
np.sum(matrix, axis=0)  # Sum along columns
np.mean(matrix, axis=1)  # Mean along rows
```




# Pandas

Pandas is a Python library for data manipulation and analysis, providing powerful data structures and tools for working with structured data.

## Creating DataFrame
`import pandas as pd`  ## Importing Pandas library


```
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)  ## Creating DataFrame from dictionary
```

# Reading Data
```
df = pd.read_csv('data.csv')  ## Reading data from CSV file
df = pd.read_excel('data.xlsx')  ## Reading data from Excel file
```

# Viewing Data
```
df.head()  ## View first 5 rows of DataFrame
df.tail()  ## View last 5 rows of DataFrame
df.sample(5)  ## View random 5 rows of DataFrame
df.shape  ## Get the dimensions of DataFrame
df.columns  ## Get the column names of DataFrame
df.info()  ## Get concise summary of DataFrame
df.describe()  ## Get statistical summary of DataFrame
```

# Selecting Data
```
df['column_name']  ## Selecting a single column
df[['column1', 'column2']]  ## Selecting multiple columns
df.loc[row_label]  ## Selecting row by label
df.iloc[row_index]  ## Selecting row by index
df.loc[row_label, 'column_name']  ## Selecting specific cell by label
df.iloc[row_index, column_index]  ## Selecting specific cell by index
df.query('condition')  ## Selecting rows based on condition
```
# Data Cleaning
```
df.dropna()  ## Remove rows with missing values
df.fillna(value)  ## Fill missing values with specified value
df.drop_duplicates()  ## Remove duplicate rows
df.drop(columns=['column_name'])  ## Remove column
df.rename(columns={'old_name': 'new_name'})  ## Rename column
df.astype({'column_name': 'new_dtype'})  ## Change data type of column
```
# Data Manipulation
```
df['new_column'] = df['column1'] + df['column2']  ## Creating new column
df['new_column'] = df.apply(lambda row: function(row), axis=1)  ## Apply function to each row
df['new_column'] = df['column'].map(mapping_dict)  ## Mapping values based on dictionary
df.groupby('column_name').agg({'column': 'function'})  ## Group by and aggregate
```
# Data Visualization
```
df['column'].plot(kind='bar')  ## Plotting a bar chart
df.plot(x='x_column', y='y_column')  ## Plotting a line chart
df.plot.scatter(x='x_column', y='y_column')  ## Plotting a scatter plot
```
# Saving Data
```
df.to_csv('new_data.csv', index=False)  ## Saving DataFrame to CSV file
df.to_excel('new_data.xlsx', index=False)  ## Saving DataFrame to Excel file

```



# Matplotlib

## Importing Matplotlib
`import matplotlib.pyplot as plt`

## Sample data
```
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
```
## Line Plot
```
plt.plot(x, y)
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Line Plot')
plt.show()
```

## Scatter Plot
```
plt.scatter(x, y)
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Scatter Plot')
plt.show()
```
## Bar Plot
```
plt.bar(x, y)
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Bar Plot')
plt.show()
```
## Histogram
```
plt.hist(y, bins=5)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()
```
## Box Plot
```
plt.boxplot(y)
plt.ylabel('Value')
plt.title('Box Plot')
plt.show()
```
## Pie Chart
```
plt.pie(y, labels=x, autopct='%1.1f%%')
plt.title('Pie Chart')
plt.show()
```


# Seaborn

Seaborn is a Python visualization library based on Matplotlib, providing high-level interfaces for creating informative statistical graphics.

## Sample data


```
import seaborn as sns
import pandas as pd
```



# Creating sample DataFrame
```
data = {'Category': ['A', 'B', 'A', 'B', 'A'],
        'Value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
```
## Scatter Plot
```
sns.scatterplot(x='x_column', y='y_column', data=df)
```
## Line Plot
```
sns.lineplot(x='x_column', y='y_column', data=df)
```
## Bar Plot
```
x='x_column', y='y_column', data=df)
```
## Histogram
```
sns.histplot(x='column', data=df)
```
## Box Plot
```
sns.boxplot(x='x_column', y='y_column', data=df)
```
## Heatmap
```
sns.heatmap(data=df.corr(), annot=True)
```
## Pairplot
```
sns.pairplot(data=df)
```
## Violin Plot
```
sns.violinplot(x='x_column', y='y_column', data=df)
```

#Scikit-learn

`import pandas as pd`




```
# Assuming 'X' as features and 'y' as target variables
X = df[['feature1', 'feature2', ...]]
y = df['target']
```
## Train/Test Split


```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```





##Model Initialization



```
from sklearn.linear_model import LinearRegression
model = LinearRegression()
```



## Model Training

model.fit(X_train, y_train)

## Model Prediction


```
y_pred = model.predict(X_test)

```



## Model Evaluation



```
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
```



## Cross-Validation



```
# This is formatted as codefrom sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X, y, cv=5)
```


# Seaborn import  codes




# Import necessary libraries

```
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
```
```
```
# Example usage:


```
```
```
Assuming X_train, X_test, y_train, y_test are your training and testing data
```

```
```


## Linear Regression
```
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)
```
## Logistic Regression
```
logistic_reg = LogisticRegression()
logistic_reg.fit(X_train, y_train)
```

## Decision Tree Classifier
```
dt_classifier = DecisionTreeClassifier()
dt_classifier.fit(X_train, y_train)
```
## Decision Tree Regressor
```
dt_regressor = DecisionTreeRegressor()
dt_regressor.fit(X_train, y_train)
```

## Random Forest Classifier
```
rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)
```

## Random Forest Regressor
```
rf_regressor = RandomForestRegressor()
rf_regressor.fit(X_train, y_train)
```

## Support Vector Classifier (SVC)
```
svc = SVC()
svc.fit(X_train, y_train)
```

## Support Vector Regressor (SVR)
```
svr = SVR()
svr.fit(X_train, y_train)
```

## K-Nearest Neighbors Classifier
```
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)
```
## K-Nearest Neighbors Regressor
```
knn_regressor = KNeighborsRegressor()
knn_regressor.fit(X_train, y_train)

```



# Cross val score


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression