<div class="alert alert-block alert-success">
    <h1 align="center">Scikit-Learn Tips</h1>
    <h3 align="center">Tip 19 : Syntethic data - Part 1</h3>
    <h4 align="center"><a href="http://www.iran-machinelearning.ir">Soheil Tehranipour</a></h5>
</div>

Generate a random n-class classification problem.

Imagine you just learned about a new classification algorithm. And you want to explore it further. Maybe you’d like to try out its hyperparameters to see how they affect performance.

The only problem is - you can’t find a good dataset to experiment with.

Don’t fret. Scikit-Learn has written a function just for you!

You can use make_classification() to create a variety of classification datasets. Here are a few possibilities:

* Generate binary or multiclass labels.
* Create labels with balanced or imbalanced classes.

Let’s create a few such datasets. We’ll also build RandomForestClassifier models to classify a few of them.

In [10]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, # 1000 observations 
    n_features=5, # 5 total features
    n_informative=3, # 3 'useful' features
    n_classes=2, # binary target/label 
    random_state=85 # if you want the same results as mine
)

Here are the basic input parameters for the function make_classification():

* n_samples: How many observations do you want to generate?
* n_features: The number of numerical features.
* n_informative: The number of features that are ‘useful.’ Only these features carry the signal that your model will use to classify the dataset.
* n_classes: The number of unique classes (values) for the target label.


In [11]:
import pandas as pd
dataset = pd.DataFrame(X)
dataset.columns = ['X1', 'X2', 'X3', 'X4', 'X5']
dataset['y'] = y
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      1000 non-null   float64
 1   X2      1000 non-null   float64
 2   X3      1000 non-null   float64
 3   X4      1000 non-null   float64
 4   X5      1000 non-null   float64
 5   y       1000 non-null   int32  
dtypes: float64(5), int32(1)
memory usage: 43.1 KB


In [12]:
dataset['y'].value_counts()

1    508
0    492
Name: y, dtype: int64

In [13]:
dataset.head()

Unnamed: 0,X1,X2,X3,X4,X5,y
0,-0.77135,-1.301386,1.535153,-0.931661,1.395875,0
1,0.149792,1.397874,-1.917018,0.423957,-1.596891,1
2,1.256057,0.870594,-0.624548,1.268236,-0.629394,0
3,0.962549,2.049226,-0.902791,1.488852,0.667549,0
4,0.000505,-1.131726,2.040444,-0.175372,2.101964,0


In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate

# initialize classifier
classifier = RandomForestClassifier() 

# Run cross validation with 10 folds
scores = cross_validate(
    classifier, X, y, cv=10, 
    # measure score for a list of classification metrics
    scoring=['accuracy', 'precision', 'recall', 'f1']
)

scores = pd.DataFrame(scores)
scores.mean().round(4)

fit_time          0.2841
score_time        0.0124
test_accuracy     0.8230
test_precision    0.8186
test_recall       0.8385
test_f1           0.8275
dtype: float64

## Imbalanced Dataset

In [15]:
X, y = make_classification(
    # the usual parameters
    n_samples=1000, n_features=5, n_informative=3, n_classes=2, 
    # Set label 0 for  97% and 1 for rest 3% of observations
    weights=[0.97], 
)

In [16]:
pd.DataFrame(y).value_counts()

0    964
1     36
dtype: int64

## Multiclass Dataset 🔗

In [17]:
X, y = make_classification(
    # same parameters as usual 
    n_samples=1000, n_features=5, n_informative=3,
    # create target label with 3 classes
    n_classes=3, 
)

In [18]:
pd.DataFrame(y).value_counts()

1    334
0    333
2    333
dtype: int64

<img src="https://webna.ir/wp-content/uploads/2018/08/%D9%85%DA%A9%D8%AA%D8%A8-%D8%AE%D9%88%D9%86%D9%87.png" width=50% />