# Random Sampling in Pandas
In this notebook, we'll explore how to perform random sampling in Pandas using the `sample()` method. Random sampling is useful for creating subsets of data for analysis, training/testing, or bootstrapping.

## 1. Basic Random Sampling
Use the `sample()` method to randomly sample a specified number of rows.

In [3]:
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)

# Randomly sample 2 rows
sampled_df = df.sample(n=2)
print(sampled_df)

   A   B
3  4  40
2  3  30


## 2. Sampling with a Fraction of the Data
Instead of specifying the number of rows, you can sample a fraction of the data using the `frac` parameter.

In [5]:
# Sample 40% of the rows
sampled_df = df.sample(frac=0.4)
print(sampled_df)

   A   B
2  3  30
3  4  40


## 3. Random Sampling with Reproducibility
To ensure reproducibility (getting the same random sample every time), set a random seed using the `random_state` parameter.

In [9]:
# Sample with a fixed random seed
sampled_df = df.sample(n=3, random_state=42)
print(sampled_df)

   A   B
1  2  20
4  5  50
2  3  30


## 4. Random Sampling with Replacement
By default, sampling is done without replacement. To allow duplicates, set `replace=True`.

In [12]:
# Sample with replacement
sampled_df = df.sample(n=6, replace=True)
print(sampled_df)

   A   B
1  2  20
1  2  20
0  1  10
2  3  30
2  3  30
3  4  40


## 5. Random Sampling of Columns
To sample random columns instead of rows, use the `axis` parameter.

In [15]:
# Randomly sample 1 column
sampled_columns = df.sample(n=1, axis=1)
print(sampled_columns)

   A
0  1
1  2
2  3
3  4
4  5


## 6. Weighted Sampling
You can specify weights for rows using the `weights` parameter. This assigns different probabilities to rows being selected.

In [18]:
# Assign weights to rows
sampled_df = df.sample(n=3, weights=[0.1, 0.2, 0.3, 0.4, 0.0], random_state=42)
print(sampled_df)

   A   B
2  3  30
3  4  40
1  2  20


## 7. Shuffling Rows
Random sampling of all rows (`frac=1`) achieves shuffling.

In [21]:
# Shuffle the entire DataFrame
shuffled_df = df.sample(frac=1, random_state=42)
print(shuffled_df)

   A   B
1  2  20
4  5  50
2  3  30
0  1  10
3  4  40


# Examples with Real Dataset

### If we use replace = false it will not give the repetitive elements

In [35]:
## For same random state the output will be the same

In [37]:
df = pd.read_csv(r'../Datasets/Property_Crimes.csv')
df.head()

Unnamed: 0,Area_Name,Year,Group_Name,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
0,Andaman & Nicobar Islands,2001,Burglary - Property,3. Burglary,27,64,755858,1321961
1,Andhra Pradesh,2001,Burglary - Property,3. Burglary,3321,7134,51483437,147019348
2,Arunachal Pradesh,2001,Burglary - Property,3. Burglary,66,248,825115,4931904
3,Assam,2001,Burglary - Property,3. Burglary,539,2423,3722850,21466955
4,Bihar,2001,Burglary - Property,3. Burglary,367,3231,2327135,17023937


In [39]:
nrows = round(df.shape[0] * .7)
df_sample = df.sample(nrows,replace=False,random_state=100)
df_sample

Unnamed: 0,Area_Name,Year,Group_Name,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
386,Andhra Pradesh,2002,Criminal Breach of Trust - Property,5. Criminal Breach of Trust,104,382,6462978,67143584
2415,Andhra Pradesh,2010,Total Property,7. Total Property Stolen & Recovered,19848,36407,725079668,1533054566
919,Delhi,2007,Dacoity -Property,1. Dacoity,30,34,1310250,6374650
238,Rajasthan,2007,Burglary - Property,3. Burglary,1634,4951,72297141,166591147
2400,Manipur,2009,Total Property,7. Total Property Stolen & Recovered,33,698,48045570,119195331
...,...,...,...,...,...,...,...,...
665,Andaman & Nicobar Islands,2010,Criminal Breach of Trust - Property,5. Criminal Breach of Trust,2,10,147644,530144
271,Puducherry,2008,Burglary - Property,3. Burglary,50,90,3810322,5486940
1756,Chhattisgarh,2001,Theft - Property,4. Theft,1664,4812,22767348,51329061
780,Goa,2003,Dacoity -Property,1. Dacoity,1,4,255000,1303000


##  BootStrap Sampling

Bootstrap sampling is when you pick the same number of rows as in the datasets but with replacment . we can use df.sample() 

In [45]:
df.sample(frac=1,replace=True,random_state=100)

Unnamed: 0,Area_Name,Year,Group_Name,Sub_Group_Name,Cases_Property_Recovered,Cases_Property_Stolen,Value_of_Property_Recovered,Value_of_Property_Stolen
1544,Bihar,2005,Robbery - Property,2. Robbery,439,2374,17663039,65234508
1859,Bihar,2004,Theft - Property,4. Theft,1989,11113,23897635,125439263
79,Delhi,2003,Burglary - Property,3. Burglary,758,1898,4268081,84069640
1930,Chandigarh,2006,Theft - Property,4. Theft,597,1234,13750425,36268500
350,Andaman & Nicobar Islands,2001,Criminal Breach of Trust - Property,5. Criminal Breach of Trust,1,10,80000,1226967
...,...,...,...,...,...,...,...,...
1198,Daman & Diu,2005,Other heads of Property,6. Other Property,2,3,45000,45000
1752,Arunachal Pradesh,2001,Theft - Property,4. Theft,193,443,6344706,12395613
2231,Puducherry,2004,Total Property,7. Total Property Stolen & Recovered,464,730,7984326,14967498
23,Mizoram,2001,Burglary - Property,3. Burglary,188,417,1595997,3249203
