<center>
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="400" alt="cognitiveclass.ai logo"  />
</center>

# **??** 


## **Lab 1. ???**

Estimated time needed: **15** minutes

## **The tasks**



## **Objectives**

In this project, we will learn how to analyze time series data using Python. We will use various libraries and methods to visualize the data, identify patterns and anomalies, and build a linear regression model. Time series analysis is a powerful tool for understanding and predicting trends in various fields, and Python provides a flexible and efficient platform for conducting this analysis.

## **Table of Contents**

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li>Loading the CSV File</li>
        <li>Setting Indexes and Data Types</li>
        <li>Sorting and Resampling the Data</li>
        <li>Visualizing the Data</li>
        <li>Dropping and Interpolating Data</li>
        <li>Checking for Missing Data</li>
        <li>Decomposing the Time Series</li>
        <li>Interpreting Autocorrelation</li>
        <li>Finding Correlations</li>
        <li>Finding the Best Correlation</li>
        <li>Building a Linear Regression Model</li>
    </ol>
</div>
<hr>


## **Dataset Structure**
<details><summary>Click here for the structure</summary>

* ### **NAME** -   
* #### `'COLUMN'`: meaning of column
* #### `'COLUMN'`: meaning of column
* #### `'COLUMN'`: meaning of column
* #### `'COLUMN'`: meaning of column
* #### `'COLUMN'`: meaning of column
* #### `'COLUMN'`: meaning of column

```

### Introduction
In this project, we will learn how to analyze time series data using Python. We will use various libraries and methods to visualize the data, identify patterns and anomalies, and build a linear regression model. Time series analysis is a powerful tool for understanding and predicting trends in various fields, and Python provides a flexible and efficient platform for conducting this analysis.

In [ ]:
! conda install -c conda-forge ta -y
! conda install scikit-learn -y

In [ ]:
import pandas as pd
import numpy as np
import seaborn as sns 

import warnings
warnings.filterwarnings("ignore")
#set precision 
pd.set_option("display.precision", 2)
#set precision for float
pd.options.display.float_format = '{:.2f}'.format

### Downloading and Loading the CSV File

Next, we will download dataset in CSV format and load it into a Pandas DataFrame. We will use the `read_csv()` function to load the data and the `head()` method to display the first few rows of the DataFrame.

In [ ]:
path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX0VZCEN/LuxuryLoanPortfolio.csv'

Load the csv:


In [ ]:
df = pd.read_csv(path,low_memory=False)
df = df
df.head()

### Setting Indexes and Data Types

To work with time series data, we need to set the index of the DataFrame to the time column. Use `df.set_index()`. Also we can to convert columns to needed type using `astype()`.

We create new Dataframe to get only needed column from our base DataFrame

In [ ]:
df_n = df[["funded_amount","funded_date"]]
df_n.set_index('funded_date',inplace = True)
df_n.head()

In [ ]:
df_n["funded_amount"] = df_n["funded_amount"].astype(float)

### Sorting and Resampling the Data

We will sort the DataFrame by the index using `df.sort_index()` and resample the data to a needed frequency using the `resample` method. This will allow us to aggregate the data over a specific time period, such as daily, weekly, or monthly.

In [ ]:
df_n = df_n.sort_index()
df_n.head(50)

Here is example of resampling to 1 Day frequency:

In [ ]:
df_n.index = pd.to_datetime(df_n.index)
df_n = df_n.resample("1D").sum()

In [ ]:
df_n.head(10)

### Visualizing the Data

We will use "Matplotlib" to create line plots of the time series data. We will `plot` the our data as well as the resampled data.

In [ ]:
import matplotlib.pyplot as plt
import matplotlib as mpl

df_n.plot()
plt.xlabel("Date")
plt.ylabel("Amount")
plt.show()

We can see the anomaly value, so we have to delete it.

### Dropping and Interpolating Data

Sometimes time series data may have missing values, which can affect our analysis. We will use the `interpolate()` method to fill in the missing values with estimated values based on the surrounding data points.

In [ ]:
# Find the index of the maximum value in the 'funded_amount' column
max_index = df_n['funded_amount'].idxmax()
# Replace the maximum value with NaN
df_n.loc[max_index, 'funded_amount'] = np.nan
# Interpolate the missing value using the linear method
df_n['funded_amount'].interpolate(method='linear', inplace=True)

Lets create new `plot` to see the result

In [ ]:
df_n.plot()
plt.xlabel("Date")
plt.ylabel("Amount")
plt.show()

### Checking for Missing Data

We will use the `df.isnull()` method to check for missing data in the DataFrame.Also we use `df.value_counts()` to see the count of data

In [ ]:
df_n.isnull().value_counts()

`isnull` returns "True" if here is "Null" and "False" if its some data

### Decomposing the Time Series

We will use the `seasonal_decompose` function from the Statsmodels library to decompose the time series into its trend, seasonal, and residual components. This will allow us to analyze the individual components and identify any patterns or anomalies.

In [ ]:
from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(df_n['funded_amount'], model='additive', period=8)

plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(df_n['funded_amount'], label='Original')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(decomposition.trend, label='Trend')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(decomposition.seasonal, label='Seasonality')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(decomposition.resid, label='Residuals')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

<div class="alert alert-block alert-success">
    <b>❓Question:</b><br>
        Use period 30 in decomposition.
</div>

In [ ]:
decomposition_q = seasonal_decompose(df_n['funded_amount'], model='additive', period=30)

<details><summary>Click here to see the answer</summary>
    decomposition_q = seasonal_decompose(df_n['funded_amount'], model='additive', period=30)

<div class="alert alert-block alert-success">
    <b>❓Question:</b><br>
        Build decomposition plot
</div>

In [ ]:
plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(df_n['funded_amount'], label='Original')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(decomposition_q.trend, label='Trend')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(decomposition_q.seasonal, label='Seasonality')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(decomposition_q.resid, label='Residuals')
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

<details><summary>Click here to see the answer</summary>
plt.figure(figsize=(12, 8))<br>
plt.subplot(411) <br>
plt.plot(df_n['funded_amount'], label='Original') <br>
plt.legend(loc='upper left')<br>
plt.subplot(412)<br>
plt.plot(decomposition_q.trend, label='Trend')<br>
plt.legend(loc='upper left')<br>
plt.subplot(413)<br>
plt.plot(decomposition_q.seasonal, label='Seasonality')<br>
plt.legend(loc='upper left')<br>
plt.subplot(414)<br>
plt.plot(decomposition_q.resid, label='Residuals')<br>
plt.legend(loc='upper left')<br>
plt.tight_layout()<br>
plt.show()

In [ ]:
df2 = pd.DataFrame()
df2['residual'] = decomposition.resid
df2 = df2.dropna()
df2.head(5)

<div class="alert alert-block alert-success">
<b>❓Question:</b><br>
    Use residual column from your decomposition to create new Data Frame
</div>

In [ ]:
df2_q = pd.DataFrame()
df2_q['residual'] = decomposition_q.resid
df2_q = df2_q.dropna()
df2_q.head(5)

<details><summary>Click here to see the answer</summary>
df2_q = pd.DataFrame()<br>
df2_q['residual'] = decomposition_q.resid<br>
df2_q = df2_q.dropna()<br>
df2_q.head(5)

### Interpreting Autocorrelation

We will use the `plot_acf` function from the Statsmodels library to visualize the autocorrelation in the time series data. This function plots the autocorrelation function (ACF) for a given time series and lag range. We will use the `lags` parameter to specify the maximum number of lags to include in the plot.

In [ ]:
import statsmodels.api as sm
sm.graphics.tsa.plot_acf(df2.values.squeeze(), lags=50)
plt.show()


<div class="alert alert-block alert-success">
<b>❓Question:</b><br>
    Build autocorrelation plot (plot_acf) using your dataframe. (Build only 10 lags)
</div>

In [ ]:
sm.graphics.tsa.plot_acf(df2_q.values.squeeze(), lags=50)
plt.show()

<details><summary>Click here to see the answer</summary>
    sm.graphics.tsa.plot_acf(df2_q.values.squeeze(), lags=50)<br>
    plt.show()

### Finding Correlations

We will use the `corr` method to calculate the correlation coefficients between the time series data and its shifted versions. This will allow us to identify any lagged relationships between the variables.

In [ ]:
for shift in range(1, 20):
    df2[f'shift_{shift}'] = df2['residual'].shift(shift)
correlations = df2.corr()    
correlations

### Finding the Best Correlation

We will use the `abs` method to calculate the absolute correlation coefficients and find the maximum value. We will also preserve the sign of the correlation coefficients to identify any negative relationships.

In [ ]:
# Calculate the absolute correlation between the original 'funded_amount' and its shifted versions
abs_correlations = df2.corrwith(df2['residual']).abs()

# Find the top 3 best absolute correlations (preserving the sign)
top_abs_correlations = abs_correlations.drop('residual').nlargest(3)

# Get the sign of the top 3 best correlations
signs = df2.corrwith(df2['residual']).drop('residual').loc[top_abs_correlations.index].apply(np.sign)

# Multiply the top 3 best correlations by their respective signs
top_abs_correlations *= signs
top_abs_correlations

<div class="alert alert-block alert-success">
<b>❓Question:</b><br>
    Take any 3 shifts from your Dataframe 
</div>

In [ ]:
shifts = pd.DataFrame()

shifts['shift_1'] = df2_q['residual'].shift(1)
shifts['shift_2'] = df2_q['residual'].shift(2)
shifts['shift_3'] = df2_q['residual'].shift(3)

shifts = shifts.dropna()
shifts

<details><summary>Click here to see the answer</summary>
    shifts = pd.DataFrame() 

    shifts['shift_1'] = df2_q['residual'].shift(1) 
    shifts['shift_2'] = df2_q['residual'].shift(2)
    shifts['shift_3'] = df2_q['residual'].shift(3)

    shifts = shifts.dropna()
    shifts

### Building a Linear Regression Model

We will use the `LinearRegression` class from the Scikit-learn library to build a linear regression model using the best correlated shifts as features. We will split the data into training and testing sets, fit the model, and evaluate its performance using the mean squared error and R-squared score.

In [ ]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Create a new DataFrame with the top 3 best correlated shifts as features
df3 = df2[top_abs_correlations.index].dropna()

# Set the target variable as the 'residual' column
target = df2.loc[df3.index, 'residual']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df3, target, test_size=0.2, shuffle = False)

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_test = model.predict(X_test)
y_pred_train = model.predict(X_train)

# Calculate the mean squared error and R-squared score
mse = mean_squared_error(y_test, y_pred_test)
r2 = r2_score(y_test, y_pred_test)

print("Mean Squared Error:", mse)
print("R-squared Score:", r2)

In [ ]:
y_test = pd.DataFrame(y_test)
y_train = pd.DataFrame(y_train)
y_test["pred"] = y_pred_test
y_train["pred"] = y_pred_train

In [ ]:
width = 12
height = 10
plt.figure(figsize=(width, height))

plt.plot(y_test, label='validation data')
plt.plot(y_train,  label='training Data')
plt.xlabel('Time')
plt.ylabel('')
plt.legend()

<a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/share-notebooks.html/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkDA0101ENSkillsNetwork20235326-2021-01-01"> CLICK HERE</a> to see how to share your notebook


# **Thank you for completing this lab!**

## Author

<a href="https://author.skills.network/instructors/ostap_liashenyk" target="_blank" >Ostap Liashenyk</a>

<a href="https://author.skills.network/instructors/yaroslav_vyklyuk_2?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Yaroslav Vyklyuk, DrSc, PhD</a>

<a href="https://author.skills.network/instructors/mariya_fleychuk?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkGuidedProjectsIBMSkillsNetworkGPXX0QGDEN2306-2023-01-01">Prof. Mariya Fleychuk, DrSc, PhD</a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By      | Change Description                                         |
| ----------------- | ------- | ----------------| ---------------------------------------------------------- |
|     2023-04-01    |   1.0   | Ostap Liashenyk | Creation of the lab                                        |

<hr>

## <h3 align="center"> © IBM Corporation 2023. All rights reserved. </h3>
