# Machine Learning Concepts, Terminologies, Problem Framing & Your First ML Problem

### What is Machine Learning?
**Definition:** ML is about enabling computers to learn from data without explicit programming.  
**Why ML?** Automates pattern recognition → decision-making.  
**Real-world examples:** Spam detection, loan prediction, image recognition, recommendation systems.

### Types of ML

| Type | Description | Examples |
|------|--------------|-----------|
| **Supervised Learning** | Model learns from labeled data (input → output known) | House price prediction, spam detection |
| **Unsupervised Learning** | Model learns patterns from unlabeled data | Customer segmentation, anomaly detection |
| **Reinforcement Learning** | Model learns by interacting with environment & receiving rewards | Game AI, self-driving cars |

### Key Terminologies

| Term | Meaning |
|------|----------|
| **Feature** | Input variable (e.g., age, salary) |
| **Label/Target** | Output variable to predict (e.g., “loan approved”) |
| **Dataset** | Collection of samples (rows) and features (columns) |
| **Training Set** | Used to train the model |
| **Test Set** | Used to evaluate model performance |
| **Model** | Mathematical representation that maps input → output |
| **Overfitting** | Model memorizes training data → poor on new data |
| **Underfitting** | Model too simple → fails to capture patterns |
| **Bias** | Error due to assumptions made by the model |
| **Variance** | Error due to model’s sensitivity to small data changes |


### ML Workflow
1. Define the problem  
2. Collect data  
3. Clean and preprocess  
4. Split data (train/test)  
5. Train the model  
6. Evaluate performance  
7. Tune hyperparameters  
8. Deploy and monitor  

## Problem Framing in ML

### 🔹 Why Framing Matters
ML problems start as **questions**, not algorithms.  
Poorly defined problems → wasted time, misleading results.


###  Steps to Frame a Problem

| Step | Description | Example |
|------|--------------|----------|
| 1. **Business Question** | What do you want to achieve? | “Can we predict customer churn?” |
| 2. **Translate to ML Problem** | Define as prediction/classification task | “Predict churn (Yes/No) based on customer data.” |
| 3. **Identify Data** | What data is available? What’s missing? | “Transaction history, service usage, feedback scores.” |
| 4. **Define Output & Metrics** | Regression or classification? What metric matters? | Accuracy, F1-score, RMSE |
| 5. **Decide Success Criteria** | When is your model “good enough”? | “>85% accuracy on unseen data.” |

### Example – Framing a Problem

- **Business Question:** Can we predict if a patient has diabetes?  
- **ML Problem:** Binary classification (Yes/No).  
- **Input:** Glucose level, BMI, age, etc.  
- **Output:** Diabetes status.  
- **Metric:** Accuracy, Precision, Recall, F1.  
- **Value:** Early prediction → preventive action.  

In [None]:
chess- b/w -categorical - classification
salary->23,86,45 continous variable-figures - regression



---

##  3. Your First ML Problem

---

###  Hands-on Steps

#### 1. Import libraries
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
```

#### 2. Load data
```python
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()
```

#### 3. Split data
```python
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

#### 4. Train model
```python
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
```

#### 5. Evaluate
```python
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```

#### 6. Interpret results
- Accuracy = how many predictions were correct.  
- Check misclassified examples.

#### 7. Discussion
- What if we add more features?  
- What if data is imbalanced?  
- Try a different algorithm (e.g., Decision Tree).  


In [2]:
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp310-cp310-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Using cached scipy-1.15.3-cp310-cp310-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp310-cp310-win_amd64.whl (8.9 MB)
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   - -------------------------------------- 0.3/8.9 MB ? eta -:--:--
   -- ------------------------------------- 0.5/8.9 MB 1.0 MB/s eta 0:00:08
   --- ------------------------------------ 0.8/8.9 MB 958.5 kB/s eta 

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [6]:

from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [8]:
x = df.drop('target', axis=1)
x

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [9]:

y = df['target']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)


In [11]:
X_train

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
81,5.5,2.4,3.7,1.0
133,6.3,2.8,5.1,1.5
137,6.4,3.1,5.5,1.8
75,6.6,3.0,4.4,1.4
109,7.2,3.6,6.1,2.5
...,...,...,...,...
71,6.1,2.8,4.0,1.3
106,4.9,2.5,4.5,1.7
14,5.8,4.0,1.2,0.2
92,5.8,2.6,4.0,1.2


In [12]:

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)



0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,200


In [13]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 1.0
