![image-2.png](attachment:image-2.png)
<center><h1>Data Collections and Sampling</h1></center>

1. **Simple Random Sampling:**
   - **Step 1:** Define the Population - Clearly identify the entire group that you want to draw conclusions about.
   - **Step 2:** List the Population - Create a list of all individuals or elements in the population.
   - **Step 3:** Assign Numbers - Assign a unique number to each individual or element on the list.
   - **Step 4:** Use a Random Number Generator - Generate random numbers and select the individuals or elements corresponding to those numbers for your sample.


In [7]:
import pandas as pd
import numpy as  np
datasets = {"feature01": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "feature02":["A", "A", 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
            "target":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}
datasets_dataframe = pd.DataFrame(datasets)
print(f"Orginal Datasets :\n{datasets_dataframe.head(10)}\n")
sample_size = 5
random_sample_select  =np.random.choice(datasets_dataframe.index, 
                                        size = sample_size, 
                                        replace = False)
simple_random_sampling = datasets_dataframe.loc[random_sample_select]
print(f"Simple Random Sampling: \n{simple_random_sampling}")

Orginal Datasets :
   feature01 feature02  target
0          1         A       0
1          2         A       1
2          3         B       0
3          4         B       1
4          5         C       0
5          6         C       1
6          7         C       0
7          8         D       1
8          9         D       0
9         10         D       1

Simple Random Sampling: 
   feature01 feature02  target
9         10         D       1
0          1         A       0
4          5         C       0
2          3         B       0
5          6         C       1


2. **Stratified Random Sampling:**
   - **Step 1:** Identify Strata - Divide the population into distinct subgroups or strata based on certain characteristics.
   - **Step 2:** Determine Proportions - Determine the proportion of individuals or elements in each stratum relative to the total population.
   - **Step 3:** Randomly Select Within Strata - Use simple random sampling within each stratum to select individuals or elements.



In [13]:
import pandas as pd
import numpy as  np
datasets = {"feature01": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "feature02":["A", "A", 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
            "target":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}

datasets_dataframe = pd.DataFrame(datasets)
print(f"Orginal Datasets :\n{datasets_dataframe.head(3)}\n")
starta =datasets_dataframe["feature02"].unique()
print(f"Starta Value is : {starta}")

sample_size = 2
new_stratified_datasets = pd.DataFrame()

for i in starta:
    starta_data = datasets_dataframe[datasets_dataframe['feature02'] == i]
    sample_starta = starta_data.sample(n = sample_size, random_state = 42)
    startified_sample = pd.concat([new_stratified_datasets, sample_starta])
    print(f"Stratified Sampling: \n{startified_sample}")

Orginal Datasets :
   feature01 feature02  target
0          1         A       0
1          2         A       1
2          3         B       0

Starta Value is : ['A' 'B' 'C' 'D']
Stratified Sampling: 
   feature01 feature02  target
1          2         A       1
0          1         A       0
Stratified Sampling: 
   feature01 feature02  target
3          4         B       1
2          3         B       0
Stratified Sampling: 
   feature01 feature02  target
4          5         C       0
5          6         C       1
Stratified Sampling: 
   feature01 feature02  target
7          8         D       1
8          9         D       0


3. **Systematic Sampling:**
   - **Step 1:** Define the Population - Clearly identify the entire population.
   - **Step 2:** Determine Sampling Interval - Calculate the sampling interval by dividing the population size by the desired sample size.
   - **Step 3:** Random Start - Choose a random starting point within the first interval.
   - **Step 4:** Select at Regular Intervals - Select every nth individual or element at regular intervals until the sample is complete.


In [17]:
# for i in range(1, 3+1):
#     print(i)

In [22]:
import pandas as pd
import numpy as  np
datasets = {"feature01": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "feature02":["A", "A", 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
            "target":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}

datasets_dataframe = pd.DataFrame(datasets)
print(f"Orginal Datasets :\n{datasets_dataframe.head(3)}\n")
sample_interval = 2

choosing_random_startingpoint = np.random.randint(1, sample_interval+1)

systemetic_sampling_indices = np.arange(choosing_random_startingpoint - 1, len(datasets_dataframe), choosing_random_startingpoint)
systemetic_sampling = datasets_dataframe.loc[systemetic_sampling_indices]
print(f"Systemetic Sampling :\n{systemetic_sampling}")

Orginal Datasets :
   feature01 feature02  target
0          1         A       0
1          2         A       1
2          3         B       0

Systemetic Sampling :
   feature01 feature02  target
1          2         A       1
3          4         B       1
5          6         C       1
7          8         D       1
9         10         D       1


In [19]:
# for i in range(1, 10, 2):
#     print(i)

4. **Cluster Sampling:**
   - **Step 1:** Define the Population - Clearly identify the entire population.
   - **Step 2:** Divide into Clusters - Divide the population into clusters, often based on geographical regions.
   - **Step 3:** Randomly Select Clusters - Randomly select a few clusters from the population.
   - **Step 4:** Include all Members - Include all individuals or elements within the selected clusters in your sample.


In [25]:
import pandas as pd
import numpy as  np
datasets = {"feature01": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "feature02":["A", "A", 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
            "target":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}

datasets_dataframe = pd.DataFrame(datasets)
print(f"Orginal Datasets :\n{datasets_dataframe.head(3)}\n")
number_of_cluster = 2
select_cluster_data = np.random.choice(datasets_dataframe['feature02'].unique(),
                                                          size = number_of_cluster,
                                                          replace = False
                                                          )
cluster_sample = datasets_dataframe[datasets_dataframe['feature02'].isin(select_cluster_data)]
print(f"Cluster Sample \n{cluster_sample}")

Orginal Datasets :
   feature01 feature02  target
0          1         A       0
1          2         A       1
2          3         B       0

Cluster Sample 
   feature01 feature02  target
0          1         A       0
1          2         A       1
7          8         D       1
8          9         D       0
9         10         D       1


5. **Convenience Sampling:**
   - **Step 1:** Identify Accessible Individuals - Choose individuals or elements that are readily available and easy to reach.
   - **Step 2:** Use Available Resources - Utilize resources that are convenient for the researcher, such as locations or existing groups.

In [27]:
import pandas as pd
import numpy as  np
datasets = {"feature01": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
            "feature02":["A", "A", 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
            "target":[0, 1, 0, 1, 0, 1, 0, 1, 0, 1]}

datasets_dataframe = pd.DataFrame(datasets)
print(f"Orginal Datasets :\n{datasets_dataframe.head(3)}\n")
convinence_sampling = 5

convinence_sample = datasets_dataframe.sample(n = convinence_sampling,
                              random_state = 42)
print(f"Convinence Sampling: \n{convinence_sample}")



Orginal Datasets :
   feature01 feature02  target
0          1         A       0
1          2         A       1
2          3         B       0

Convinence Sampling: 
   feature01 feature02  target
8          9         D       0
1          2         A       1
5          6         C       1
0          1         A       0
7          8         D       1
