# Data Processing Reference for a Single Stock - AARTIIND

This notebook processes stock price data and creates the following features:
- `rolling_avg_10`: 10-minute rolling average of close price
- `volume_sum_10`: Total volume traded over last 10 minutes
- `target`: Binary indicator if stock goes up (1) or down (0) in next 5 minutes

In [73]:
import pandas as pd
import numpy as np

## 1. Load Data

In [74]:

# Load data from file (for demonstration)
df = pd.read_csv("./AARTIIND__EQ__NSE__NSE__MINUTE.csv")

# For loading from file, use:
# df = pd.read_csv('your_data.csv')

print("Data loaded successfully!")
print(f"Shape: {df.shape}")
df.head()

Data loaded successfully!
Shape: (370458, 6)


Unnamed: 0,timestamp,open,high,low,close,volume
0,2017-01-02 09:15:00+05:30,340.0,340.0,340.0,340.0,11.0
1,2017-01-02 09:16:00+05:30,340.0,340.0,340.0,340.0,0.0
2,2017-01-02 09:17:00+05:30,340.0,340.0,340.0,340.0,0.0
3,2017-01-02 09:18:00+05:30,340.0,343.7,340.0,343.7,1.0
4,2017-01-02 09:19:00+05:30,343.7,343.7,343.7,343.7,1.0


## 2. Data Preprocessing

In [75]:
# Convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['stock_name'] = "AARTIIND"

# Sort by timestamp (important for time series operations)
df = df.sort_values('timestamp').reset_index(drop=True)

# Set timestamp as index for easier time-based operations
df.set_index('timestamp', inplace=True)

print("Data preprocessed!")
print(f"Date range: {df.index.min()} to {df.index.max()}")
df.head()

Data preprocessed!
Date range: 2017-01-02 09:15:00+05:30 to 2021-01-01 15:29:00+05:30


Unnamed: 0_level_0,open,high,low,close,volume,stock_name
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2017-01-02 09:15:00+05:30,340.0,340.0,340.0,340.0,11.0,AARTIIND
2017-01-02 09:16:00+05:30,340.0,340.0,340.0,340.0,0.0,AARTIIND
2017-01-02 09:17:00+05:30,340.0,340.0,340.0,340.0,0.0,AARTIIND
2017-01-02 09:18:00+05:30,340.0,343.7,340.0,343.7,1.0,AARTIIND
2017-01-02 09:19:00+05:30,343.7,343.7,343.7,343.7,1.0,AARTIIND


In [76]:
# Fill missing values with prior values (forward fill)
df.ffill(inplace=True)

# Check for missing values after filling
print("Missing values after filling:")
print(df.isnull().sum())


Missing values after filling:
open          0
high          0
low           0
close         0
volume        0
stock_name    0
dtype: int64


## 3. Feature Engineering

In [77]:
# Feature 1: 10-minute rolling average of close price
# Using window='10T' for 10 minutes
df['rolling_avg_10'] = df['close'].rolling(window='10min', min_periods=1).mean()

# Feature 2: Total volume traded over last 10 minutes
df['volume_sum_10'] = df['volume'].rolling(window='10min', min_periods=1).sum()

print("Rolling features created!")
df[['close', 'volume', 'rolling_avg_10', 'volume_sum_10']].head(5)

Rolling features created!


Unnamed: 0_level_0,close,volume,rolling_avg_10,volume_sum_10
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-01-02 09:15:00+05:30,340.0,11.0,340.0,11.0
2017-01-02 09:16:00+05:30,340.0,0.0,340.0,11.0
2017-01-02 09:17:00+05:30,340.0,0.0,340.0,11.0
2017-01-02 09:18:00+05:30,343.7,1.0,340.925,12.0
2017-01-02 09:19:00+05:30,343.7,1.0,341.48,13.0


In [78]:
# Remove rows where rolling_avg_10 or volume_sum_10 is NaN
df.dropna(subset=['rolling_avg_10', 'volume_sum_10'], inplace=True)

print("Removed rows with NaN in rolling_avg_10 or volume_sum_10.")
print(f"New shape: {df.shape}")

Removed rows with NaN in rolling_avg_10 or volume_sum_10.
New shape: (370458, 8)


In [79]:
# Feature 3: Target variable - does stock go up in next 5 minutes?
# Shift close price backwards by 5 minutes to get future price
df['close_5min_future'] = df['close'].shift(-5)

# Create binary target: 1 if price goes up, 0 if it goes down or stays same
df['target'] = (df['close_5min_future'] > df['close']).astype(int)

print("Target variable created!")
print(f"\nTarget distribution:")
print(df['target'].value_counts())
print(f"\nTarget ratio (up/total): {df['target'].mean():.2%}")


Target variable created!

Target distribution:
target
0    207960
1    162498
Name: count, dtype: int64

Target ratio (up/total): 43.86%


## 5. Final Dataset

In [80]:
# Drop rows where target is NaN
df_clean = df.dropna(subset=['target']).copy()

# Take the last 20 observations for the test set
test_df = df_clean.tail(20).copy()

# Remove the test observations from the training set
df_clean = df_clean.iloc[:-20].copy()


print(f"\nTest dataset (last 20 observations):")
print(f"Shape: {test_df.shape}")
print(f"\nClean dataset (removed last 20 observations for test):")
print(f"Shape: {df_clean.shape}")


Test dataset (last 20 observations):
Shape: (20, 10)

Clean dataset (removed last 20 observations for test):
Shape: (370438, 10)


## 7. Export Processed Data

In [81]:
# Save processed data to CSV
output_file = 'train.csv'
df_clean.to_csv(output_file)
print(f"Processed data saved to: {output_file}")

# Save the test dataset
test_output_file = 'test.csv'
test_df.to_csv(test_output_file)
print(f"Test data saved to: {test_output_file}")

# Display info about the saved files
print(f"\nColumns in processed output file: {list(df_clean.columns)}")
print(f"Total rows in processed data: {len(df_clean)}")
print(f"\nColumns in test output file: {list(test_df.columns)}")
print(f"Total rows in test data: {len(test_df)}")

Processed data saved to: train.csv
Test data saved to: test.csv

Columns in processed output file: ['open', 'high', 'low', 'close', 'volume', 'stock_name', 'rolling_avg_10', 'volume_sum_10', 'close_5min_future', 'target']
Total rows in processed data: 370438

Columns in test output file: ['open', 'high', 'low', 'close', 'volume', 'stock_name', 'rolling_avg_10', 'volume_sum_10', 'close_5min_future', 'target']
Total rows in test data: 20
