---

### 📘 **Task: Evaluating Linear Regression with Varying Random States** (Total: **9 Marks**)

You are given a dataset named `Fish.csv`, which contains measurements of various features of fish and their corresponding weights. Your task is to analyze how the performance of a **Linear Regression** model varies with different random states used during the train-test split.

#### 🔹 **Part A: Data Preparation and Model Training** (3 Marks)

1. Load the dataset using `pandas`.  
2. Select the features: `['Length1', 'Length2', 'Length3', 'Height', 'Width']` and target: `'Weight'`.  
3. Using a loop, perform the following for random states from 0 to 42:  
   - Split the data into training and test sets (80/20).
   - Train a **Linear Regression** model.
   - Predict the test set.
   - Record the **R² score** and **Mean Squared Error (MSE)**.

#### 🔹 **Part B: Visualization** (3 Marks)

1. Plot two line graphs using `matplotlib`:
   - **R² Score vs. Random State**
   - **Mean Squared Error vs. Random State**
2. Each graph should have:
   - Title, labeled axes, and clearly marked points (`marker='o'`).

#### 🔹 **Part C: Analysis and Interpretation** (3 Marks)

Answer the following questions based on your observations:
1. Which random state gave the **highest R² score**? What was the value?  
2. Which random state gave the **lowest MSE**? What was the value?  
3. Explain briefly (in 2-3 sentences) why performance may vary with different random states.

---


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

In [5]:
df = pd.read_csv('Fish.csv')
df

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340
...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672


In [6]:
x = df[['Length1' , 'Length2' , 'Length3' , 'Height' , 'Width']] # Fetures means coloum
y = df['Weight'] # Labels means row
print(type(x))
print(type(y))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [12]:
random_states = range(0, 43)
r2_scores = []
mse_values = []

for state in random_states:
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.2, random_state=state
    )

In [13]:
model = LinearRegression()

In [14]:
model.fit(x_train,y_train)

LinearRegression()

In [16]:
y_pred = model.predict(x_test)

In [19]:
r2 = r2_score(y_test, y_pred)
r2

0.8821430593048695

In [20]:
mse = mean_squared_error(y_test, y_pred)
mse

16763.88719314074

In [None]:
__ __
<!-- 

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

# Load the dataset
fish_data = pd.read_csv('Fish.csv')

# Feature Selection
features = ['Length1', 'Length2', 'Length3', 'Height', 'Width']
X = fish_data[features]
y = fish_data['Weight']

# Initialize lists to store results
random_states = range(0, 43)
r2_scores = []
mse_values = []

# Loop over different random states
for state in random_states:
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=state
    )

    # Fit the model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    # Store results
    r2_scores.append(r2)
    mse_values.append(mse)

    # Print results
    print(f'Random State: {state}, R2 Score: {r2:.4f}, Mean Squared Error: {mse:.4f}')

# Plotting the results
plt.figure(figsize=(12, 6))

# Plot R2 Score
plt.subplot(1, 2, 1)
plt.plot(random_states, r2_scores, marker='o')
plt.title('R2 Score vs. Random State')
plt.xlabel('Random State')
plt.ylabel('R2 Score')

# Plot Mean Squared Error
plt.subplot(1, 2, 2)
plt.plot(random_states, mse_values, marker='o', color='r')
plt.title('Mean Squared Error vs. Random State')
plt.xlabel('Random State')
plt.ylabel('Mean Squared Error')

plt.tight_layout()
plt.show()
```

---

# ✏️ **Analysis and Interpretation**

### 1. Highest R² Score:
After observing the printed outputs or the graph:
- Suppose **Random State = 22** gave the highest R² score.
- **Highest R² Score = 0.95** (example value — actual value depends on your dataset).

### 2. Lowest MSE:
- Suppose **Random State = 22** also gave the lowest Mean Squared Error.
- **Lowest MSE = 120.45** (example value — actual value depends on your dataset).

### 3. Short Explanation:
> The model performance varies with different random states because the way the data is split into training and testing sets changes each time. Some splits may have more representative samples, while others may have outliers or less diverse samples in the training or testing set, affecting the model's ability to generalize.

--- -->

__ __
<!-- 

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt

# Load the dataset
fish_data = pd.read_csv('Fish.csv')

# Feature Selection
features = ['Length1', 'Length2', 'Length3', 'Height', 'Width']
X = fish_data[features]
y = fish_data['Weight']

# Initialize lists to store results
random_states = range(0, 43)
r2_scores = []
mse_values = []

# Loop over different random states
for state in random_states:
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=state
    )

    # Fit the model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Make predictions
    y_pred = model.predict(X_test)

    # Evaluate the model
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    # Store results
    r2_scores.append(r2)
    mse_values.append(mse)

    # Print results
    print(f'Random State: {state}, R2 Score: {r2:.4f}, Mean Squared Error: {mse:.4f}')

# Plotting the results
plt.figure(figsize=(12, 6))

# Plot R2 Score
plt.subplot(1, 2, 1)
plt.plot(random_states, r2_scores, marker='o')
plt.title('R2 Score vs. Random State')
plt.xlabel('Random State')
plt.ylabel('R2 Score')

# Plot Mean Squared Error
plt.subplot(1, 2, 2)
plt.plot(random_states, mse_values, marker='o', color='r')
plt.title('Mean Squared Error vs. Random State')
plt.xlabel('Random State')
plt.ylabel('Mean Squared Error')

plt.tight_layout()
plt.show()
```

---

# ✏️ **Analysis and Interpretation**

### 1. Highest R² Score:
After observing the printed outputs or the graph:
- Suppose **Random State = 22** gave the highest R² score.
- **Highest R² Score = 0.95** (example value — actual value depends on your dataset).

### 2. Lowest MSE:
- Suppose **Random State = 22** also gave the lowest Mean Squared Error.
- **Lowest MSE = 120.45** (example value — actual value depends on your dataset).

### 3. Short Explanation:
> The model performance varies with different random states because the way the data is split into training and testing sets changes each time. Some splits may have more representative samples, while others may have outliers or less diverse samples in the training or testing set, affecting the model's ability to generalize.

--- -->