# Cheet Sheet

Data tables are presented in Comma Delimited, CSV text file format. Although this file format allows for the data table to be easily retrieved into a variety of applications, they are best viewed within one that will allow one to easily manipulate data that is in columnar format.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("autodata.csv")
df.head()

from sklearn.datasets import load_boston
data = load_boston

In [None]:
avg_stroke = df['stroke'].astype('float').mean(axis=0)
print(avg_stroke)

df['stroke'].replace(np.nan, avg_stroke, inplace=True)

In [None]:
df['num-of-doors'].value_counts()
df['num-of-doors'].value_counts().idxmax()
#drop row with null values in perticular column
df.dropna(subset=['horsepower-binned'], axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

In [None]:
#chaging the data values data standardization
df['city-L/100km'] = 235/df["city-mpg"]

**Data Wrangling is the process of converting data from the initial format to a format that may be better for analysis.**

**Data Normalization**
Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling variable so the variable values range from 0 to 1

**Data Standardization**
Data is usually collected from different agencies with different formats. (Data Standardization is also a term for a particular type of data normalization, where we subtract the mean and divide by the standard deviation)

**What is Standardization?**

Standardization is the process of transforming data into a common format which allows the researcher to make the meaningful comparison.

**Evaluating for Missing Data**
The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:

1. isnull()
2. notnull()
The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data. "True" stands for missing value, while "False" stands for not missing value.

Deal with missing data

Drop data
Drop the whole row
Drop the whole column
Replace data
Replace it by mean
Replace it by frequency / mode
Replace it based on other functions

In [None]:
df['length'] = df['length']/df['length'].max()
df[['length', 'width']].head()

In [None]:
sns.boxplot(df['Maths'])
df.skew(axis=0) #col

1. Mean: The arithmetical mean is the sum of a set of numbers separated by the number of numbers in the collection, or simply the mean or the average.

2. Median: In a sorted, ascending or descending, list of numbers, the median is the middle number and may be more representative of that data set than the average.

3. Mode: The mode is the value that most frequently appears in a data value set.

4. Standard Deviation: A calculation of the amount of variance or dispersion of a set of values is the standard deviation.

5. Variance: The expectation of the square deviation of a random variable from its mean is variance.

In [None]:
df.nunique(axis=0) #cols
df.nunique(axis=1) #rows

In [None]:
if (df["Marks"]<35).any():
    print(df[df['Marks']<35])

**Assign 3**

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability

Measure of central tendency shows where the center or middle of the data set is located, whereas measure of variation shows the dispersion among data values.

1. Variance: A measure of how far a set of numbers are spread out from each other. It describes how far the numbers lie from the mean (expected value). It is the square of standard deviation.
2. Standard deviation (SD): it is only used for data that are “normally distributed”. SD indicates how much a set of values is spread around the average. SD is determined by the variance (SD=the root of the variance).
3. Interquartile range (IQR): the interquartile range (IQR), is also known as the 'midspread' or 'middle fifty', is a measure of statistical dispersion, being equal to the difference between the third and first quartiles[3]. IQR = Q3 − Q1. Unlike (total) range, the interquartile range is a more commonly used statistic, since it excludes the lower 25% and upper 25%, therefore reflecting more accurately valid values and excluding the outliers.
4. Range: it is the length of the smallest interval which contains all the data and is calculated by subtracting the smallest observation (sample minimum) from the greatest (sample maximum) and provides an indication of statistical dispersion. It bears the same units as the data used for calculating it. Because of its dependance on just two observations, it tends to be a poor and weak measure of dispersion, with the only exception being when the sample size is large.


In [None]:
print(df.iloc[:, [1,5]])
print(df.iloc[:, 1:5])

In [None]:
#inter queartile range
from scipy.stats import iqr
iqr(df['ApplicantIncome'])

The skewness values can be interpreted in the following manner:
1. Highly skewed distribution: If the skewness value is less than −1 or greater than +1.
2. Moderately skewed distribution: If the skewness value is between −1 and −½ or between +½ and +1.
3. Approximately symmetric distribution: If the skewness value is between −½ and +½.

In [None]:
df.rename(columns={0:'Sepal length'})

In [None]:
df.groupby('Species').mean()
df['Species'].value_counts()
df.groupby(['Species']).count()
df['Species'].unique()

In [None]:
df = pd.DataFrame(data=boston.data, columns=boston.feature_names)

**Assign 4**

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.9, random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
from sklearn.metrics import mean_squared_error
rsme = np.sqrt(mean_squared_error(y_test,y_pred))
print("Root Mean Square Error is: ")
print(rsme)

print("Accuracy of Training is: ")
lr.score(x_train,y_train)

print("Accuracy of Testing is: ")
lr.score(x_test, y_test)

With random_state=0 , we get the same train and test sets across different executions.

Linear regression analysis is used to predict the value of a variable based on the value of another variable.

The Mean Squared Error measures how close a regression line is to a set of data points. It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function.

Standard scalar standardizes features of the data set by scaling to unit variance and removing the mean (optionally) using column summary statistics on the samples in the training set.

Logistic regression estimates the probability of an event occurring, such as voted or didn't vote, based on a given dataset of independent variables.

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)

y_predict = classifier.predict(x_test)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,precision_score, recall_score, mean_squared_error
cm = confusion_matrix(y_test,y_predict)
print(cm)

In [None]:
accuracy_score(y_test, y_predict)

recall_score(y_test,y_predict)

f1_score(y_test,y_predict)

precision_score(y_test,y_predict)

mean_squared_error(y_test,y_predict)

classification_report(y_test,y_predict)

#### Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability.

It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features.

 **P(A|B) = [P(B|A)P(A)] / P(B)**

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

In [None]:
cm = confusion_matrix(y_test,y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
print(cm)
disp.plot()
import matplotlib.pyplot as plt
plt.show()

In [None]:
def get_confusion_matrix_values(y_true, y_pred):
    cm = confusion_matrix(y_true,y_pred)
    return(cm[0][0], cm[0][1], cm[1][0], cm[1][1])

TP, FP, FN, TN = get_confusion_matrix_values(y_test, y_pred)
print("TP: ", TP)
print("FP: ", FP)
print("TN: ", TN)
print("FN: ", FN)

print("The Accuracy is: ", (TP+TN)/(TP+TN+FP+FN))
print("The precision is: ", TP/(TP+FP))
print("The recall is: ", TP/(TP+FN))

Confusion matrices have two types of errors: Type I and Type II

1. Accuracy (all correct / all) = TP + TN / TP + TN + FP + FN
2. Misclassification (all incorrect / all) = FP + FN / TP + TN + FP + FN
3. Precision (true positives / predicted positives) = TP / TP + FP
4. Sensitivity aka Recall (true positives / all actual positives) = TP / TP + FN
5. Specificity (true negatives / all actual negatives) =TN / TN + FP

**Assign 7**

In NLTK, PUNKT is an unsupervised trainable model, which means it can be trained on unlabeled data (Data that has not been tagged with information identifying its characteristics, properties, or categories is referred to as unlabeled data.)

Tokenizations
Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences.

In [None]:
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize
w1 = word_tokenize("hello i am yuuurlrl akdkd")

print(w1)
form nltk.tokenize import word_tokenize, sent_tokenize

POS (Part of Speech) Tagging

The pos(parts of speech) explain you how a word is used in a sentence. In the sentence, a word have different contexts and semantic meanings. The basic natural language processing(NLP) models like bag-of-words(bow) fails to identify these relation between the words. For that we use pos tagging to mark a word to its pos tag based on its context in the data. Pos is also used to extract rlationship between the words.

In [None]:
text = 'alslkkjdk dkdk'
word = word_tokenize(text)
pos_tag(word)

Stopword removal

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

In [None]:
from nltk.corpus import stopwords
stop1 = stopwords.words('english')

txt = 'i like mathematics. mathematics is the easiest subject in my life'
clean_text = []
word = word_tokenize(txt)
for w in word:
  if w not in stop1:
    clean_text.append(w)

print("Original text",word)
print("after stop word removal", clean_text)

Stemming
Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.

In [None]:
from nltk import SnowballStemmer
sbs = SnowballStemmer('english')
text = "Nltk full form is natural language took kit. Engineering needs a vision"
word = word_tokenize(text)
print("Original Word: Word after stemming")
for w in word:
  print(w, " : ", sbs.stem(w))

Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

# Part 2
**TF-IDF (Term Frequency-Inverse Document Frequency)**, a commonly used weighting technique for information retrieval and information exploration.

TF-IDF is a statistical method used to evaluate the importance of a word to a file set or a file in a corpus. The importance of the word increases in proportion to the number of times it appears in the file, but at the same time decreases inversely with the frequency of its appearance in the corpus.

* **Term frequency TF (item frequency)**: number of times a given word appears in the text. This number is usually normalized (the numerator is generally smaller than the denominator) to prevent it from favoring long documents, because whether the term is important or not, it is likely to appear more often in long documents than in paragraph documents.

> **TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).**

Term frequency (TF) indicates how often a term (keyword) appears in the text .

This number is usually normalized (usually the word frequency divided by the total number of words in the article) to prevent it from favoring long documents.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

d0 = 'new york times'
d1 = 'new york post'
d2 = 'los angels time'
series = [d0,d1,d2]

#Create an object of tf-idf
tf_idf = TfidfVectorizer()
#get tf-idf values
result = tf_idf.fit_transform(series)

from nltk.text import TextCollection
from nltk.tokenize import word_tokenize

sents = ['this is sentence one', 'this is sentence two', 'this is sentence three']

sents = [word_tokenize(sent) for sent in sents]

print(sents)

cps = TextCollection(sents)
print(cps)

tf=cps.tf('one', cps)
print(tf)

idf=cps.idf('one')
print(idf)

tf_idf=cps.tf_idf('one',cps)
print(tf_idf)

In [None]:
sns.countplot(x=data['survived'])
sns.boxplot(x=data['age'])
sns.histplot(x=data['fare'], data=data, bins=20, hue='sex')

In [None]:
sns.boxplot(x=data['sex'], y=data['male'], color='red')
myplt = {"male":"b", "female":"r"}
sns.boxplot(x=data['sex'], y=data['age'], palette=myplt)
sns.boxplot(x=data['sex'], y=data['age'], data=data, hue='survived')

Observations:
1. We created a box plot of variables 'age & 'sex' & used survival as the hue
2. The we visualized three variables Age, Sex & Survival. Two out of these are categorical and one is numerica
3. The median age of female who didn't survived is slightly lower than female survived.
4. The median age of male who didn't survived is slightly greater than male survived.

In [None]:
sns.histplot(x=data["SepalLengthCm"])

In [None]:
sns.heatmap(data.iloc[:,0:5])

In [None]:
yplt = {"Iris-setosa":"b", "Iris-versicolor":"r", "Iris-virginica":"y"}
sns.boxplot(x=data['Species'], y=data["SepalLengthCm"], data=data, palette=myplt)

In [None]:
sns.histplot(x=data['Species'], color="red")

In [None]:
sns.boxplot(data['SepalWidthCm'])

A histogram is a graphical representation of data points organized into user-specified ranges.

A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heatmaps are used in various forms of analytics but are most commonly used to show user behavior on specific webpages or webpage templates.