### Machine Learning for Data Quality Prediction
**Description**: Use a machine learning model to predict data quality issues.

**Steps**:
1. Create a mock dataset with features and label (quality issue/label: 0: good, 1: issue).
2. Train a machine learning model.
3. Evaluate the model performance.

In [1]:
# Task: Use a machine learning model to predict data quality issues.
# Steps:
# 1. Create a mock dataset with features and label (quality issue/label: 0: good, 1: issue).
# 2. Train a machine learning model.
# 3. Evaluate the model performance.

# 1. Create a mock dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
n_samples = 1000

# Features that might indicate data quality issues
df = pd.DataFrame({
    'missing_values_count': np.random.randint(0, 10, n_samples),
    'outlier_score': np.random.rand(n_samples),
    'data_type_consistency': np.random.choice([0, 1], n_samples, p=[0.9, 0.1]), # 1 if inconsistent
    'range_violation_count': np.random.randint(0, 5, n_samples),
    'unique_value_ratio': np.random.uniform(0.5, 1.0, n_samples),
    'format_consistency': np.random.choice([0, 1], n_samples, p=[0.95, 0.05]) # 1 if inconsistent format
})

# Create a label based on a combination of these features
# Higher missing values, outlier score, inconsistencies, and range violations
# are more likely to indicate a data quality issue.
df['quality_issue'] = (
    (df['missing_values_count'] > 5) |
    (df['outlier_score'] > 0.8) |
    (df['data_type_consistency'] == 1) |
    (df['range_violation_count'] > 3) |
    (df['format_consistency'] == 1)
).astype(int)

print("Mock Dataset:")
print(df.head())
print("\nDistribution of Data Quality Labels:")
print(df['quality_issue'].value_counts())

# 2. Train a machine learning model
# Separate features (X) and label (y)
X = df.drop('quality_issue', axis=1)
y = df['quality_issue']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train a Random Forest Classifier model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

print("\nRandom Forest Classifier model trained.")

# 3. Evaluate the model performance
# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("\nModel Evaluation:")
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(report)

Mock Dataset:
   missing_values_count  outlier_score  data_type_consistency  \
0                     6       0.256016                      0   
1                     3       0.726096                      0   
2                     7       0.592963                      0   
3                     4       0.102213                      0   
4                     6       0.918751                      0   

   range_violation_count  unique_value_ratio  format_consistency  \
0                      4            0.570612                   0   
1                      0            0.785439                   0   
2                      1            0.592635                   0   
3                      2            0.639322                   0   
4                      1            0.609355                   0   

   quality_issue  
0              1  
1              0  
2              1  
3              0  
4              1  

Distribution of Data Quality Labels:
quality_issue
1    672
0    328
Na