## Codveda Technologies Internship  

### Task 1: Data Preprocessing for Machine Learning  

**Intern:** Muhammad Sakibur Rahaman  
  
**Dataset Used:** stockprice.csv  

#### Objective:  

- Manage missing values  

- Convert categories to numbers
- Store or scale numerical data  

- Divide dataset into train and test sets  

---  

## Importing Dependencies  

We import the relevant Python libraries for data preprocessing.



In [32]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import os


## Step 1: Data Overview
We load the dataset and explore its structure, data types, and check for missing values.


In [33]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
path = '/content/drive/MyDrive/Machine Learning Task List'
for file in os.listdir(path):
    print(file)


4) house Prediction Data Set.csv
churn-bigml-80.csv
churn-bigml-20.csv
3) Sentiment dataset.csv
1) iris.csv
2) Stock Prices Data Set.csv


In [35]:
df = pd.read_csv("/content/drive/MyDrive/Machine Learning Task List/2) Stock Prices Data Set.csv")


## Step 2: Explore Dataset
We explore the dataset to understand its structure, data types, and missing values.


In [36]:
# View dataset info
df.info()

# Check missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# View unique symbols (categorical variable)
print("\nUnique symbols:")
print(df['symbol'].unique())

# Basic stats
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497472 entries, 0 to 497471
Data columns (total 7 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   symbol  497472 non-null  object 
 1   date    497472 non-null  object 
 2   open    497461 non-null  float64
 3   high    497464 non-null  float64
 4   low     497464 non-null  float64
 5   close   497472 non-null  float64
 6   volume  497472 non-null  int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 26.6+ MB

Missing values per column:
symbol     0
date       0
open      11
high       8
low        8
close      0
volume     0
dtype: int64

Unique symbols:
['AAL' 'AAPL' 'AAP' 'ABBV' 'ABC' 'ABT' 'ACN' 'ADBE' 'ADI' 'ADM' 'ADP'
 'ADSK' 'ADS' 'AEE' 'AEP' 'AES' 'AET' 'AFL' 'AGN' 'AIG' 'AIV' 'AIZ' 'AJG'
 'AKAM' 'ALB' 'ALGN' 'ALK' 'ALLE' 'ALL' 'ALXN' 'AMAT' 'AMD' 'AME' 'AMGN'
 'AMG' 'AMP' 'AMT' 'AMZN' 'ANDV' 'ANSS' 'ANTM' 'AON' 'AOS' 'APA' 'APC'
 'APD' 'APH' 'ARE' 'ARNC' 'ATVI' 'AVB' 'AVGO' 'A

Unnamed: 0,open,high,low,close,volume
count,497461.0,497464.0,497464.0,497472.0,497472.0
mean,86.352275,87.132562,85.552467,86.369082,4253611.0
std,101.471228,102.312062,100.570957,101.472407,8232139.0
min,1.62,1.69,1.5,1.59,0.0
25%,41.69,42.09,41.28,41.70375,1080166.0
50%,64.97,65.56,64.3537,64.98,2084896.0
75%,98.41,99.23,97.58,98.42,4271928.0
max,2044.0,2067.99,2035.11,2049.0,618237600.0


### Summary:

- The dataset contains 497,472 rows and 7 columns.

- Columns include stock symbol, date, open, high, low, close prices, and volume.

- dataset contains 11 missing values in 'open', 8 in 'high' and 8 in 'low'.

- 'symbol' and 'date' are categorical/object columns; others are numeric.

## Step 3: Handle Missing Data
We fill missing values in 'open', 'high', 'low' columns with their respective median values.


In [37]:
# Fill missing numeric columns with median
df['open'].fillna(df['open'].median(), inplace=True)
df['high'].fillna(df['high'].median(), inplace=True)
df['low'].fillna(df['low'].median(), inplace=True)

# Verify no missing values remain
df.isnull().sum()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['open'].fillna(df['open'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['high'].fillna(df['high'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

Unnamed: 0,0
symbol,0
date,0
open,0
high,0
low,0
close,0
volume,0


In [38]:
# Check missing values
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
symbol    0
date      0
open      0
high      0
low       0
close     0
volume    0
dtype: int64


## Step 4: Encode Categorical Variables
We apply One-Hot Encoding to the 'symbol' categorical column.


In [39]:
# One-hot encode 'symbol' column
df_encoded = pd.get_dummies(df, columns=['symbol'])

# Preview new columns
df_encoded.head()


Unnamed: 0,date,open,high,low,close,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,...,symbol_XL,symbol_XLNX,symbol_XOM,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS
0,2014-01-02,25.07,25.82,25.06,25.36,8998943,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
1,2014-01-02,79.3828,79.5756,78.8601,79.0185,58791957,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
2,2014-01-02,110.36,111.88,109.29,109.74,542711,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
3,2014-01-02,52.12,52.33,51.52,51.98,4569061,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,2014-01-02,70.11,70.23,69.48,69.89,1148391,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Step 5: Process 'date' Column
We extract year, month, day from the 'date' column and drop the original.


In [40]:
# Convert 'date' to datetime
df_encoded['date'] = pd.to_datetime(df_encoded['date'])

# Extract year, month, day
df_encoded['year'] = df_encoded['date'].dt.year
df_encoded['month'] = df_encoded['date'].dt.month
df_encoded['day'] = df_encoded['date'].dt.day

# Drop original 'date' column
df_encoded.drop('date', axis=1, inplace=True)

df_encoded.head()


Unnamed: 0,open,high,low,close,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,symbol_ABBV,...,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS,year,month,day
0,25.07,25.82,25.06,25.36,8998943,False,True,False,False,False,...,False,False,False,False,False,False,False,2014,1,2
1,79.3828,79.5756,78.8601,79.0185,58791957,False,False,False,True,False,...,False,False,False,False,False,False,False,2014,1,2
2,110.36,111.88,109.29,109.74,542711,False,False,True,False,False,...,False,False,False,False,False,False,False,2014,1,2
3,52.12,52.33,51.52,51.98,4569061,False,False,False,False,True,...,False,False,False,False,False,False,False,2014,1,2
4,70.11,70.23,69.48,69.89,1148391,False,False,False,False,False,...,False,False,False,False,False,False,False,2014,1,2


## Step 6: Normalize Numerical Features
We apply standard scaling to numeric features.


In [41]:
scaler = StandardScaler()
#scaling numerical columns
numeric_cols = ['open', 'high', 'low', 'close', 'volume']
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])

df_encoded.head()


Unnamed: 0,open,high,low,close,volume,symbol_A,symbol_AAL,symbol_AAP,symbol_AAPL,symbol_ABBV,...,symbol_XRAY,symbol_XRX,symbol_XYL,symbol_YUM,symbol_ZBH,symbol_ZION,symbol_ZTS,year,month,day
0,-0.60394,-0.599272,-0.601492,-0.601239,0.57644,False,True,False,False,False,...,False,False,False,False,False,False,False,2014,1,2
1,-0.06868,-0.073859,-0.066541,-0.072439,6.625058,False,False,False,True,False,...,False,False,False,False,False,False,False,2014,1,2
2,0.236604,0.241887,0.236033,0.230318,-0.450782,False,False,True,False,False,...,False,False,False,False,False,False,False,2014,1,2
3,-0.337359,-0.34016,-0.338392,-0.338901,0.038319,False,False,False,False,True,...,False,False,False,False,False,False,False,2014,1,2
4,-0.160065,-0.165204,-0.15981,-0.1624,-0.377207,False,False,False,False,False,...,False,False,False,False,False,False,False,2014,1,2


## Step 7: Split into Training and Testing Sets
We split features and target variable into training and testing sets.


In [42]:
#dividing into X,Y dataset
X = df_encoded.drop('close', axis=1)
y = df_encoded['close']
#spliting dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)


Training set shape: (397977, 512)
Testing set shape: (99495, 512)


## Conclusion
In this task, we:
- Handled missing data
- Encoded categorical variables
- Transformed date column
- Normalized numerical features
- Split dataset into training and testing sets

The dataset is now ready for machine learning modeling.
