# Accessing Indian Market Data for Machine Learning Trading Strategies

This notebook demonstrates how to adapt the machine learning for trading techniques in this repository for Indian stock markets. We'll show how to download data for Indian stocks and apply the same feature engineering and modeling approaches used throughout the book.

## Setup & Installation

**Note:** Before running this notebook, ensure you have the required packages installed:

```bash
pip install yfinance pandas numpy matplotlib seaborn scikit-learn
```

If you're using the conda environment from this repository's installation instructions, these packages should already be available.

## Imports & Settings

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

plt.style.use('seaborn-v0_8')
sns.set_palette('husl')

## Indian Stock Universe

Let's define a universe of popular Indian stocks from different sectors:

In [None]:
# Popular Indian stocks with NSE symbols
indian_stocks = {
    # Technology
    'TCS.NS': 'Tata Consultancy Services',
    'INFY.NS': 'Infosys',
    'WIPRO.NS': 'Wipro',
    'TECHM.NS': 'Tech Mahindra',
    
    # Financial Services
    'HDFCBANK.NS': 'HDFC Bank',
    'ICICIBANK.NS': 'ICICI Bank',
    'SBIN.NS': 'State Bank of India',
    'KOTAKBANK.NS': 'Kotak Mahindra Bank',
    
    # Energy & Oil
    'RELIANCE.NS': 'Reliance Industries',
    'ONGC.NS': 'Oil and Natural Gas Corporation',
    'BPCL.NS': 'Bharat Petroleum',
    
    # Pharmaceuticals
    'SUNPHARMA.NS': 'Sun Pharmaceutical',
    'DRREDDY.NS': 'Dr. Reddy\'s Laboratories',
    'CIPLA.NS': 'Cipla',
    
    # Consumer Goods
    'HINDUNILVR.NS': 'Hindustan Unilever',
    'ITC.NS': 'ITC Limited',
    'NESTLEIND.NS': 'Nestle India',
    
    # Automotive
    'MARUTI.NS': 'Maruti Suzuki',
    'TATAMOTORS.NS': 'Tata Motors',
    'M&M.NS': 'Mahindra & Mahindra'
}

symbols = list(indian_stocks.keys())
print(f"Indian stock universe: {len(symbols)} stocks")
for symbol, name in list(indian_stocks.items())[:5]:
    print(f"{symbol}: {name}")
print("...")

## Downloading Indian Stock Data

In [None]:
# Define date range for historical data
end_date = datetime.now()
start_date = end_date - timedelta(days=365*2)  # 2 years of data

print(f"Downloading data from {start_date.date()} to {end_date.date()}")

In [None]:
# Download data for all Indian stocks
data = yf.download(symbols, start=start_date, end=end_date, progress=True)

print(f"Downloaded data shape: {data.shape}")
print(f"Date range: {data.index[0]} to {data.index[-1]}")
print(f"Columns: {data.columns.names}")

In [None]:
# Check data quality
print("Data availability by stock:")
availability = data['Close'].count().sort_values(ascending=False)
print(availability.head(10))

# Plot data availability
plt.figure(figsize=(12, 6))
availability.plot(kind='bar')
plt.title('Data Availability by Stock (Number of Trading Days)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Market Index Data

In [None]:
# Download major Indian market indices
indices = {
    '^NSEI': 'NIFTY 50',
    '^BSESN': 'SENSEX',
    '^NSEBANK': 'NIFTY BANK'
}

index_data = yf.download(list(indices.keys()), start=start_date, end=end_date)

# Plot major indices
plt.figure(figsize=(12, 8))
for i, (symbol, name) in enumerate(indices.items()):
    plt.subplot(2, 2, i+1)
    index_prices = index_data['Close'][symbol].dropna()
    plt.plot(index_prices.index, index_prices.values)
    plt.title(f'{name} ({symbol})')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Feature Engineering for Indian Stocks

Now we'll apply the same feature engineering techniques used throughout the book to Indian stock data:

In [None]:
# Focus on a few liquid stocks for feature engineering
liquid_stocks = ['RELIANCE.NS', 'TCS.NS', 'HDFCBANK.NS', 'INFY.NS', 'ITC.NS']

# Extract OHLCV data
ohlcv_data = {}
for stock in liquid_stocks:
    stock_data = pd.DataFrame()
    stock_data['open'] = data['Open'][stock]
    stock_data['high'] = data['High'][stock]
    stock_data['low'] = data['Low'][stock]
    stock_data['close'] = data['Close'][stock]
    stock_data['volume'] = data['Volume'][stock]
    
    # Remove any rows with missing data
    stock_data = stock_data.dropna()
    ohlcv_data[stock] = stock_data

print(f"Prepared OHLCV data for {len(liquid_stocks)} stocks")
print(f"Sample data shape for {liquid_stocks[0]}: {ohlcv_data[liquid_stocks[0]].shape}")

In [None]:
def calculate_technical_features(df):
    """Calculate technical indicators for Indian stocks"""
    result = df.copy()
    
    # Price-based features
    result['returns'] = df['close'].pct_change()
    result['log_returns'] = np.log(df['close'] / df['close'].shift(1))
    
    # Moving averages
    result['sma_5'] = df['close'].rolling(5).mean()
    result['sma_20'] = df['close'].rolling(20).mean()
    result['sma_50'] = df['close'].rolling(50).mean()
    
    # Price ratios
    result['price_to_sma20'] = df['close'] / result['sma_20']
    result['sma5_to_sma20'] = result['sma_5'] / result['sma_20']
    
    # Volatility
    result['volatility_20'] = result['returns'].rolling(20).std()
    
    # Volume features
    result['volume_ma_20'] = df['volume'].rolling(20).mean()
    result['volume_ratio'] = df['volume'] / result['volume_ma_20']
    
    # Price range features
    result['high_low_ratio'] = df['high'] / df['low']
    result['close_to_high'] = df['close'] / df['high']
    result['close_to_low'] = df['close'] / df['low']
    
    # Momentum indicators
    result['rsi'] = calculate_rsi(df['close'], 14)
    result['momentum_10'] = df['close'] / df['close'].shift(10) - 1
    
    return result

def calculate_rsi(prices, window=14):
    """Calculate Relative Strength Index"""
    delta = prices.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    rsi = 100 - (100 / (1 + rs))
    return rsi

In [None]:
# Calculate features for all liquid stocks
enhanced_data = {}
for stock in liquid_stocks:
    enhanced_data[stock] = calculate_technical_features(ohlcv_data[stock])
    print(f"Features calculated for {stock}: {enhanced_data[stock].shape[1]} features")

# Display sample features for one stock
sample_stock = liquid_stocks[0]
print(f"\nSample features for {sample_stock}:")
print(enhanced_data[sample_stock].columns.tolist())
print(f"\nLast 5 rows of data:")
print(enhanced_data[sample_stock][['close', 'returns', 'sma_20', 'rsi', 'volatility_20']].tail())

## Market Analysis: Indian Stock Characteristics

In [None]:
# Analyze correlation between stocks
returns_data = pd.DataFrame()
for stock in liquid_stocks:
    returns_data[stock] = enhanced_data[stock]['returns']

correlation_matrix = returns_data.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
           square=True, linewidths=0.5)
plt.title('Correlation Matrix: Indian Stock Returns')
plt.tight_layout()
plt.show()

print("Average correlation between stocks:", correlation_matrix.values[np.triu_indices_from(correlation_matrix.values, k=1)].mean())

In [None]:
# Compare volatility patterns
plt.figure(figsize=(12, 8))
for i, stock in enumerate(liquid_stocks):
    plt.subplot(2, 3, i+1)
    volatility = enhanced_data[stock]['volatility_20'].dropna()
    plt.plot(volatility.index, volatility.values)
    plt.title(f'20-day Rolling Volatility: {stock}')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Preparing Data for Machine Learning

Now we'll prepare the data in the same format used throughout the book for ML models:

In [None]:
def prepare_ml_dataset(stock_data_dict, target_days=5):
    """Prepare dataset for ML prediction of Indian stocks"""
    ml_data = []
    
    for stock_symbol, data in stock_data_dict.items():
        # Calculate forward returns as target
        data = data.copy()
        data['target'] = data['close'].shift(-target_days) / data['close'] - 1
        data['symbol'] = stock_symbol.replace('.NS', '')  # Clean symbol name
        
        # Select features (exclude non-numeric and target-related columns)
        feature_cols = ['returns', 'sma_5', 'sma_20', 'sma_50', 'price_to_sma20', 
                       'sma5_to_sma20', 'volatility_20', 'volume_ratio', 'high_low_ratio',
                       'close_to_high', 'close_to_low', 'rsi', 'momentum_10']
        
        # Create feature dataframe
        features = data[feature_cols + ['target', 'symbol']].dropna()
        features['date'] = features.index
        
        ml_data.append(features)
    
    # Combine all stocks
    combined_data = pd.concat(ml_data, ignore_index=True)
    return combined_data

# Prepare ML dataset
ml_dataset = prepare_ml_dataset(enhanced_data, target_days=5)
print(f"ML dataset shape: {ml_dataset.shape}")
print(f"Features: {[col for col in ml_dataset.columns if col not in ['target', 'symbol', 'date']]}")
print(f"\nTarget distribution (5-day forward returns):")
print(ml_dataset['target'].describe())

In [None]:
# Visualize target distribution
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
ml_dataset['target'].hist(bins=50, alpha=0.7)
plt.title('Distribution of 5-day Forward Returns')
plt.xlabel('Returns')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
ml_dataset.boxplot(column='target', by='symbol', ax=plt.gca())
plt.title('Forward Returns by Stock')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Sample Machine Learning Application

Here's a simple example of how to apply the ML techniques from the book to Indian stock data:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler

# Prepare features and target
feature_cols = [col for col in ml_dataset.columns if col not in ['target', 'symbol', 'date']]
X = ml_dataset[feature_cols]
y = ml_dataset['target']

# Remove any remaining NaN values
mask = ~(X.isna().any(axis=1) | y.isna())
X_clean = X[mask]
y_clean = y[mask]

print(f"Clean dataset shape: {X_clean.shape}")
print(f"Features used: {feature_cols}")

In [None]:
# Split data chronologically (important for time series)
# Use first 80% for training, last 20% for testing
split_idx = int(0.8 * len(X_clean))
X_train, X_test = X_clean.iloc[:split_idx], X_clean.iloc[split_idx:]
y_train, y_test = y_clean.iloc[:split_idx], y_clean.iloc[split_idx:]

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred = rf_model.predict(X_test_scaled)

# Evaluate performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"\nModel Performance on Indian Stocks:")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"Mean absolute actual return: {np.abs(y_test).mean():.4f}")

In [None]:
# Feature importance analysis
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance for Indian Stock Prediction')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

print("Top 5 most important features:")
print(feature_importance.head())

In [None]:
# Prediction vs actual scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Returns')
plt.ylabel('Predicted Returns')
plt.title('Prediction vs Actual: Indian Stock Returns')
plt.tight_layout()
plt.show()

# Calculate correlation between predictions and actual
correlation = np.corrcoef(y_test, y_pred)[0, 1]
print(f"Correlation between predicted and actual returns: {correlation:.3f}")

## Next Steps for Indian Market Trading Strategies

This notebook demonstrates the basic adaptation of ML4T techniques to Indian markets. To build a complete trading strategy, consider:

### 1. Enhanced Feature Engineering
- **Sector-specific features**: Indian market sectors (IT, Pharma, Banking) have unique characteristics
- **Macro indicators**: Include Indian economic indicators (repo rate, inflation, etc.)
- **Currency features**: USD/INR exchange rate affects export-heavy stocks

### 2. Indian Market-Specific Considerations
- **Impact cost**: Higher for smaller caps compared to US markets
- **Settlement & clearing**: T+2 settlement cycle
- **Market timing**: IST 9:15 AM - 3:30 PM trading window
- **Regulatory framework**: SEBI regulations and disclosure requirements

### 3. Advanced Data Sources
- **Corporate actions**: Bonus issues, stock splits, dividends
- **Fundamental data**: Annual reports, quarterly results
- **Alternative data**: News sentiment in Hindi/English, satellite data for infrastructure companies

### 4. Strategy Implementation
- **Backtesting**: Adapt Zipline or use Indian market backtesting frameworks
- **Execution**: Connect to Indian brokers' APIs (Zerodha, Angel Broking, etc.)
- **Risk management**: Consider Indian market-specific risks

### 5. Resources for Further Development
- **NSE/BSE historical data**: Official exchange data
- **Indian broker APIs**: Zerodha Kite, Angel Broking, etc.
- **Economic data**: RBI, Ministry of Statistics data
- **News sources**: Economic Times, Business Standard APIs

In [None]:
# Save processed data for further analysis
# ml_dataset.to_csv('indian_stocks_ml_dataset.csv', index=False)
# print("Dataset saved as 'indian_stocks_ml_dataset.csv'")

print("\n=== Summary ===")
print(f"Successfully processed {len(liquid_stocks)} Indian stocks")
print(f"Generated {len(feature_cols)} technical features")
print(f"Built ML model with {correlation:.3f} prediction correlation")
print("\nThe techniques from this book can be successfully adapted for Indian markets!")