<a href="https://colab.research.google.com/github/yingzibu/data_science_worldQuant/blob/main/datalab_020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Null

Check null value counts: `df.isnull().sum()`

Check null percentage: `df.isnull().sum()/len(df)`

```
df.describe()
df.info()
```

# select categorical

`df.select_dtypes('object')`

`df.select_dtypes('object').nunique()`

# Splitting strings

It might be useful to split strings into their constituent parts, and create new columns to contain them. we will use `.str.split` method.

```
df[["col1", "col2"]] = df["col1-col2"].str.split(",", expand=True)
```

# Recasting data

Depending on who formatted the dataset, the types of data assigned to each column might need to be changed. If, for example, a column containing only numbers had been mistaken for a column containing only strings, we'd need to change that through a process called recasting `astype` method

```
newdf = df.astype('str')
```

Only recasting individual columns

```
df['col'] = df['col'].astype(int)
```

# Dropping columns

`drop`

```
df2 = df.drop("col_name", axis='columns')

df.dropna(inplace=True)
```

# Concatenating

## Concatenating DataFrames

```
concat_df = pd.concat([df1, df2])
```

# Replacing col values
```
df['col'].replace(old_val, new_val)
dict_ = {old: new}
df['col'].replace(dict_)
```

# subsetting with masks

To create subsets from a larger dataset is through **masking**. Masks are ways to filter out the data you are not interested in so that you can focus on the data you are.

```
mask = df[col] > 200
mask
```
Notice that `mask` is a Series of Boolean values. Where properties are smaller than 200, our statement evaluates as `False`; where they are bigger than 200, it evaluates to `True`.

```
df[mask]
```

# Histograms

A **histogram** is a graph that shows the frequency distribution of numerical data. In addition to helping us understand frequency, histograms are also useful for detecting outliers.

```
df = pd.read_csv(file_path, usecols = ['col_name'])
plt.hist(df, bins=10, rwidth=0.9, color='b')
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.grid(axis='y', alpha = 0.75)
```

You might have noticed that there are ten bars. In a histogram, we call these bars **bins**. A bin is simply a way to group data to make it easier to see trends. You can use as many or as few as you like; just recognize that the fewer bins you use, the less detailed the output will become.



# Scatter plots

A **scatter plot** is a graph that uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables, and are especially useful if you're looking for correlations.

```
plt.scatter(df['col1'], df['col2'], color = 'r')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
```

# Quantiles

```
low, high = df['col_name'].quantile([0.1, 0.9])
mask = df['col_name'].between(low, high)
df[mask]
```

# Contain strings

```
mask = df['col_name'].str.contains('strings')
df[mask]
```

# Line plots

```
df = pd.DataFrame({'x_coords': range(0, 9000, 1000)})
df['y_coords'] = y0 + k * df['x_coords']
df.plot(x='x_coords', y='y_coords', xlabel=xlabel, ylabel=ylabel, label=label)
```

# Linear Regression

```
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)
y_pred = model.predict(X_train)
intercept = model.intercept_
coefficient = model.coef_

y_line = intercept + coefficient * X_train.values

plt.plot(X_train.values, y_line, label='Linear Regression Model')
plt.scatter(X_train, y_train)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.title(title)
plt.legend()

```

## Model types

**Linear Regression** is a way to predict the value of some a target variable by fitting a line that best describes the relationship between **X** and **y** for the values we already have. If you remember `y = mx + b`, `y` is the intercept, and the `b` is the beta coefficient. The beta coefficient tells us what change we can expect to see in `X` for every one-unit increase in `y`.

## Statistical concepts

### Cost functions

When we train a model, we are solving an optimization problem. We provide training data to an algorithm and tell it to find the model or model parameters that best fit the data. But how can the algorithm judge what the "best" fit is? What criteria should it use?

A **cost function** (sometimes also called a loss or error function) is a mathematical formula that provides the score by whic the algorithm will determine the best fit. Generally, the goal is to minimize the cost function and get the lowest score. For linear models, these functions measure distance, and the model tries to get the closest fit to the data. For tree-based model, they measure impurity, and the model tries to get the most terminal nodes.

### Residuals

When we perform any type of regression analysis, we end up with a line of best fit. Because our data comes from the real world, it tends to be a little bit messy, so the data points usually dont fall exactly on this line. Most of the time, they are scattered around it, and a residual is the vertical distance between each individual data point and the regression line. Each data point has only one residual which can be positive if it's above the regression line, negative if it's below the regression line, r zero if the line passes directly through the point. Think of it like this: the model describes theoretical line. That line doesn't really exist outside the model. The residuals, however, are true values; they represent the actual data that came from real observations.

### Performance metrics

In statistics, an error is the difference between a measurement and reality. There may not be any difference at all, but there's usually something not quite right, and we need to account for that in our model. To do that, we need to figure out the **mean absolute error (MAE)**. Absolute error is the error in a single measurement, and mean absolute error is the average error over the course of several measurements.

## Data concepts

### Leakage

**Leakage** is the use of data in training your model that would not be typically be available when making predictions. For example, suppose we want to predict property prices in USD but include property prices in Mexican Pesos in our model. If we assume a fixed exchange rate or a nearly constant exchange rate, then our model will have a low error on the training data, but this will not be reflective of its performance on real world data.

### Imputation

Datasets are often incomplete or missing entries. If the dataset is large and the missing entries are few, then the missing entries aren't all that important. But sometimes, it might be useful to include data with missing entries by finding a way to **impute** the missing entries in a row or column of a DataFrame. For example, you might use extrapolation when the data points have a pattern, or you might approximate the missing value by mean values.
`SimpleImputer`

```
from sklearn.impute import SimpleImputer
columns = ['col1', 'col2']
df = pd.read_csv(filepath, usecols=columns)
imputer = SimpleImputer()
imputer.fit(df)
imputed = imputer.transform(df)
# convert into DataFrame
df = pd.DataFrame(imputed, columns=columns)

```
### Generalization

Notice that we tested the model with a dataset that is different from the one we used to train the model. Machine learning models are useful if they allow you to make predictions about data other than what you used to train your model. We call this concept **generalization**. By testing your model with different data than you used to train it, you're checking to see if your model can generalize. Most machine learning models do not generalize to all possible types of input data, so they should be used with care. On the other hand, machine learning models that don't generalize to make predictions for at least a restricted set of data aren't very useful.

## Model concepts

### Hyperparameters

When we instantiate an estimator, we can pass keyword arguments that will dictate its structure. These arguments are called **hyperparameters**. For example, when we defined our decision tree estimator, we chose how many layers the tree would have using the `max_depth` keyword. This is in contrast to **parameters**, which are the numbers that our model used to make predictions based on features. Parameters are optimized during the training process based on data and input features. They keep changing during training to fit the data and only the best performed ones were selected. Hyperparameters values are set before training begins and will not be changed during the training process Pretty much all models have hyperparameters. Even a simple linear regressor has a hyperparameter: `fit_intercept`. Here are some common examples for hyperparameters:

* The imputation strategy used for missing data
* The number of trees in a random forest model
* The number of jobs to run in parallel when fitting and predicting

# plot scatter plot

## Map
```
df['lon'] = df.lon.astype(float)
df['lat'] = df.lat.astype(float)
fig = px.scatter_mapbox(
    df,
    lat=df['lat'],
    lon=df['lon'],
    width=600,
    height=600,
    hover_data=['price_aprox_usd'],
)
fig.update_layout(mapbox_style='open-street-map')
fig.show()


```

## 3D scatter plot

```
fig = px.scatter_3d(
    df,
    lat=df['lat'],
    lon=df['lon'],
    z=df['price_aprox_usd'],
    labels = {"lon":"longitude", "lat":"latitude", "price_aprox_usd":"price"},
    width=600,
    height=500,
    
)
fig.update_traces(
    marker={"size":4, "line":{"width":2, "color":"DarkSlateGrey"}},
    selector={"mode":"markers"},
)
fig.show()

```


# Creating a pipeline in scikit-learn

In [4]:
!pip install category_encoders --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/85.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.9/85.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [7]:
import pandas as pd
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from category_encoders import OneHotEncoder
lin_reg = linear_model.LinearRegression()
pipe = Pipeline([('ohe', OneHotEncoder(use_cat_names=True)),
                    ("regressor", lin_reg)], )
# feature_names = pipe.named_steps['ohe'].get_feature_names()
pipe.named_steps

{'ohe': OneHotEncoder(use_cat_names=True), 'regressor': LinearRegression()}

# glob

```
import glob

glob.glob("./data/ada-[0-9].csv")
glob.glob("./data/ada*")
```

It's not difficult to find the ones we want. The `.glob` function allows for pattern matching. Here are a few of the more common ones:

* `*` match any number of characters
* `?` match a single character of any kind
* `[a-z]` match any lower case alphabetical character in the current locale
* `[A-Z]` match any upper case alphabetical character in the current locale
* `[!a-z]` do not match any lower case alphabetical character in the current locale

So far, you have only searched for files in one specific directory `data`, it's also possible to search for files in subdirectories. To get a listing of all notebook files starting from the directory above this one and all others below it, can use:

```
glob.glob("../**/*.ipynb", recursive=True)
```

# One-hot encoding

A property's district is **categorical data**, or data which can be divided into groups. For many machine learning algorithms, it's common to create a column in a DataFrame to indicate if the feature is present or absent, instead of using the category's name. For each observation, you put a 1 or a 0 to indicate if the property is located in each neighborhood or not.

```
from category_encoders import OneHotEncoder
ohe = OneHotEncoder(use_cat_names=True)
ohe.fit(df)
df_ohe = ohe.transform(df)
```

# Ordinal encoding

For many ML algorithms, it's common to use one-hot encoding. This works well if there are a few categories, but as the number of feature grows, the number of additional columns also grows.

Having a large number of columns (and consequently a large number of features in your model) can lead to a number of issues often referred to as the **curse of dimensionality**. Two primary issues that can arise are computational complexity (operations performed on large datasets may take longer) and overfitting (the model may not generalize to new data). In these scenarios, ordinal encoding is a popular choice for encoding the categorical variable. Instead of creating new columns, ordinal encoding simply replaces the categories in a categorical variable with integers.

One potential risk of ordinal encoding is that some ML algorithms assume the integer values imply an ordering the variables. This is important in logistic regression, where a relationship is defined between increases or decreases in the features and the target. Techniques like decision trees are okay to use ordinal encoding, because they generate splits. Rather than assuming any ordering between the numeric values, the splits will occur between the numeric values and effectively separate them.

```
from category_encoders import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit(df)
X_train_oe = oe.transform(df)
```

# Ridge regression

Sometimes the values for coefficients and the intercept - both positive and negative - are very large. When you see this in a linear model - especially a high-dimensional model - what's happening is that the model is overfitting to the training data and then can't generalize to the test data. Some people call this the curse of dimensionality.

The way to solve this problem is to use regularization, a group of techniques that prevent overfitting. In this case, we will change the predictor from `LinearRegression` to `Ridge`, which is a linear regressor with an added tool for keeping model coefficients from getting too big.

# Multicollinearity

Two features have high correlation with each other.


In [8]:
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline

model = make_pipeline(
    OneHotEncoder(use_cat_names=True), # encode for categorical data
    SimpleImputer(), # impute missing values
    Ridge(), # regularization
)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_train)

```
def wrangle(filepath):
    # Read CSV file
    df = pd.read_csv(filepath)

    # Subset data: Apartments in "Capital Federal", less than 400,000
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
    mask_apt = df["property_type"] == "apartment"
    mask_price = df["price_aprox_usd"] < 400_000
    df = df[mask_ba & mask_apt & mask_price]

    # Subset data: Remove outliers for "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]

    # Split "lat-lon" column
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)

    # Get place name
    df["neighborhood"] = df["place_with_parent_names"].str.split("|", expand=True)[3]
    df.drop(columns="place_with_parent_names", inplace=True)

   
    return df
```