# Drug Consumption Classification Analytics

## Sprint 1: Project Initiation

### Project Overview

This project involves analyzing and classifying drug consumption patterns based on demographic, psychological, and behavioral data. The dataset includes features such as age, gender, education, country, ethnicity, personality scores (Nscore, Escore, Oscore, Ascore, Cscore, Impulsiveness, Sensation Seeking (SS)), and information on the use of various drugs over different time periods. The goal is to uncover patterns and predictors in drug use behavior, which can provide valuable insights for public health interventions and policy-making.

### The Problem Area

Drug abuse remains a significant public health crisis, affecting millions of individuals and communities worldwide. It leads to severe health complications, social problems, economic burdens, and legal issues. The complexity of drug abuse necessitates a multi-faceted approach to understanding, predicting, and mitigating its impact. The primary problem area addressed by this project is the need for accurate classification and prediction of drug consumption patterns to better inform prevention and intervention strategies.

### Challenges and opportunities to address:
 
#### Challenges:

#### 1. Data Quality and Availability:
**Challenge:** Obtaining comprehensive and high-quality data on drug use is difficult. There may be issues with missing data, inaccuracies, and inconsistencies due to self-reporting biases and underreporting.\
**Opportunity:** Collaborate with healthcare providers, rehabilitation centers, and government agencies to access and integrate diverse datasets. Use advanced data cleaning and imputation techniques to handle missing or incomplete data.

#### 2. Privacy and Ethical Concerns:
**Challenge:** Handling sensitive data on drug use raises significant privacy and ethical issues. Ensuring the confidentiality and security of individuals' data is paramount.\
**Opportunity:** Implement robust data protection measures and ethical guidelines. Use de-identified data and secure data-sharing agreements to maintain privacy while enabling research.

#### 3. Complexity of Drug Use Behavior:
**Challenge:** Drug use behavior is influenced by a multitude of factors, including psychological, social, economic, and environmental aspects. Modeling such complex behavior accurately is challenging.\
**Opportunity:** Employ advanced machine learning and AI techniques that can handle complex, multi-dimensional data. Use techniques like ensemble learning, deep learning, and feature engineering to capture the intricate relationships between variables.

#### 4. Intervention Effectiveness: ####
**Challenge:** Developing interventions that are both effective and scalable can be difficult. Interventions need to be tailored to diverse populations and settings, which requires a deep understanding of the target groups.\
**Opportunity:** Use data-driven insights to design personalized and targeted interventions. Implement pilot programs to test and refine these interventions before scaling up. Leverage technology for scalable solutions, such as mobile health applications and online support systems.

#### 5. Policy and Regulatory Barriers:
**Challenge:** Navigating the regulatory landscape and ensuring compliance with laws and policies related to drug use and data usage can be restrictive and complex.\
**Opportunity:** Engage with policymakers and regulatory bodies to advocate for data-driven approaches to drug prevention and treatment. Use research findings to inform policy changes that support effective interventions and data sharing for public health purposes.

### Opportunities:

#### 1.Predictive Analytics for Early Intervention:

**Opportunity:** Use predictive modeling to identify individuals at high risk of drug abuse early on. This can enable timely interventions that prevent the escalation of substance use and related harms.\
**Application:** Develop risk assessment tools that healthcare providers can use to screen patients and offer targeted support and resources.

#### 2. Personalized Treatment Plans:
**Opportunity:** Leverage data to create personalized treatment plans based on an individual's specific risk factors, psychological profile, and drug use history.\
**Application:** Implement personalized treatment strategies in rehabilitation centers, enhancing the effectiveness of programs and improving patient outcomes.

#### 3. Enhanced Public Health Surveillance:
**Opportunity:** Improve public health surveillance systems by integrating data from various sources, enabling real-time monitoring of drug use trends and emerging threats.\
**Application:** Use this data to inform public health campaigns, allocate resources more effectively, and respond rapidly to outbreaks of substance abuse.

#### 4. Community-Based Programs:
**Opportunity:** Design community-based programs that are informed by local data, addressing the specific needs and challenges of different communities.\
**Application:** Work with community organizations to implement evidence-based prevention and support programs, increasing their relevance and impact.

#### 5. Educational and Awareness Campaigns:
**Opportunity:** Use insights from data analysis to develop targeted educational and awareness campaigns that address misconceptions and provide accurate information about drug use and its risks.\
**Application:** Create tailored messages for different demographic groups, utilizing social media, schools, workplaces, and other platforms to reach a wide audience.

By addressing these challenges and leveraging the opportunities, the project can make significant strides in understanding and mitigating drug abuse, ultimately leading to better health outcomes and more effective public health strategies.



### The User

The Drug Consumption Classification project has the potential to significantly impact various stakeholders by providing valuable insights and tools to combat drug abuse. Through predictive modeling, targeted interventions, and data-driven decision-making, the project aims to improve public health outcomes and enhance the effectiveness of prevention and treatment programs.

**Primary Users:**
1) Public Health Officials and Policymakers
2) Healthcare Providers and Practitioners
3) Rehabilitation Centers and Counselors
4) Researchers and Academics
5) Non-Governmental Organizations (NGOs) and Community Groups

**Secondary Beneficiaries:**
1) Individuals and Families
2) Society at Large
3) Employers and Workplaces

### The Big Idea
Machine learning offers robust methodologies to address the identified challenges:

The Drug Consumption Classification project leverages data science to tackle the pervasive issue of drug abuse by identifying patterns and predictors of drug use based on demographic, psychological, and behavioral data. By developing predictive models and data-driven insights, the project aims to enable early identification of at-risk individuals, personalize intervention and treatment plans, and optimize resource allocation for public health strategies. This innovative approach not only benefits public health officials, healthcare providers, rehabilitation centers, researchers, and community organizations but also positively impacts individuals, families, society at large, and workplaces. Ultimately, the project seeks to reduce drug abuse rates, improve public health outcomes, and enhance the effectiveness of prevention and treatment programs.

### The Impact
The societal and business implications are significant:

#### 1. Improved Public Health:
**Impact:** Early identification and intervention can prevent drug abuse escalation, reducing healthcare burdens and drug-related deaths.\
**Fact:** Nearly 92,000 drug overdose deaths occurred in the U.S. in 2020.

#### 2. Enhanced Quality of Life:
**Impact:** Personalized treatment improves recovery rates, benefiting individuals and families, and fostering healthier communities.\
**Fact:** Effective interventions can significantly lower the $740 billion annual cost of drug-related healthcare, lost productivity, and criminal justice expenses in the U.S.

#### 3.Workplace Safety and Productivity:
**Impact:** Addressing drug abuse reduces absenteeism, improves presenteeism, and lowers health insurance costs, boosting overall productivity.\
**Fact:** Businesses can save on healthcare costs and enhance workforce efficiency by supporting drug abuse interventions.

#### 4. Corporate Social Responsibility and Innovation:
**Impact:** Companies can enhance their reputation and community ties by engaging in public health initiatives, and capitalize on market opportunities in health technologies.\
**Fact:** Collaborations with healthcare prov
iders for developing intervention strategies can position businesses as leaders in corporate social responsibility.

## Drug Consumption Classification
Database contains records for 1885 respondents with detailed meta information

### Context
Database contains records for 1885 respondents. For each respondent 12 attributes are known: Personality measurements which include NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), BIS-11 (impulsivity), and ImpSS (sensation seeking), level of education, age, gender, country of residence and ethnicity. All input attributes are originally categorical and are quantified. After quantification values of all input features can be considered as real-valued. In addition, participants were questioned concerning their use of 18 legal and illegal drugs (alcohol, amphetamines, amyl nitrite, benzodiazepine, cannabis, chocolate, cocaine, caffeine, crack, ecstasy, heroin, ketamine, legal highs, LSD, methadone, mushrooms, nicotine and volatile substance abuse and one fictitious drug (Semeron) which was introduced to identify over-claimers. For each drug they have to select one of the answers: never used the drug, used it over a decade ago, or in the last decade, year, month, week, or day. Database contains 18 classification problems. Each of independent label variables contains seven classes: "Never Used", "Used over a Decade Ago", "Used in Last Decade", "Used in Last Year", "Used in Last Month", "Used in Last Week", and "Used in Last Day".

In [1]:
# importing packages to be used later
import numpy as np
import pandas as pd
import sys

In [2]:
# Data Dictionary DataFrame
datad = { 
        'Column Name' : ['ID','Age','Gender','Education','Country','Ethnicity','Nscore','Escore','Oscore','Ascore',
        'Cscore','Impulsive','SS','Alcohol','Amphet','Amyl','Benzos','Caff','Cannabis',
        'Choc','Coke','Crack','Ecstasy','Heroin','Ketamine','Legalh','LSD','Meth','Mushrooms',
        'Nicotine','Semer','VSA'],
        'Description' : ['Unique ID for individual user',
        'Describes the age range of the individuals',
        'Indicates the gender of the individual',
        'Represents the highest level of education attained by the individual',
        'Indicates the country of residence of the individual',
        'Describes the ethnic background of the individual',
        'Nscore (Neuroticism Score): Quantifies the tendency towards emotional instability and anxiety',
        'Escore (Extraversion Score): Measures the tendency to be outgoing, sociable, and energetic',
        'Oscore (Openness Score): Indicates the level of creativity and openness to new experiences',
        'Ascore (Agreeableness Score): Represents the tendency to be compassionate and cooperative',
        'Cscore (Conscientiousness Score): Measures the level of self-discipline and goal-directed behavior',
        'Quantifies the level of impulsiveness of the individual',
        'SS (Sensation Seeking): Represents the tendency to seek novel and thrilling experiences',
        'Indicates the frequency of alcohol consumption',
        'Amphet (Amphetamines): Indicates the frequency of amphetamine use',
        'Amyl (Amyl Nitrite): Indicates the frequency of amyl nitrite use',
        'Benzos (Benzodiazepines): Indicates the frequency of benzodiazepine use',
        'Caff (Caffeine): Indicates the frequency of caffeine consumption',
        'Indicates the frequency of cannabis use',
        'Choc (Chocolate): Indicates the frequency of chocolate consumption',
        'Coke (Cocaine): Indicates the frequency of cocaine use',
        'Crack (Crack Cocaine): Indicates the frequency of crack cocaine use',
        'Indicates the frequency of ecstasy use',
        'Indicates the frequency of heroin use',
        'Indicates the frequency of ketamine use',
        'Legalh (Legal Highs): Indicates the frequency of legal highs use',
        'Indicates the frequency of LSD use',
        'Meth (Methadone): Indicates the frequency of methadone use',
        'Indicates the frequency of hallucinogenic mushrooms use',
        'Indicates the frequency of nicotine use',
        'Semer (Semeron): Indicates the frequency of semeron use',
        'VSA (Volatile Substance Abuse): Indicates the frequency of volatile substance abuse']    
        }
data_dic = pd.DataFrame(datad)

# Display the DataFrame as a markdown table
print(data_dic.to_markdown(index=False))



| Column Name   | Description                                                                                        |
|:--------------|:---------------------------------------------------------------------------------------------------|
| ID            | Unique ID for individual user                                                                      |
| Age           | Describes the age range of the individuals                                                         |
| Gender        | Indicates the gender of the individual                                                             |
| Education     | Represents the highest level of education attained by the individual                               |
| Country       | Indicates the country of residence of the individual                                               |
| Ethnicity     | Describes the ethnic background of the individual                                                  |
| Nscore        | Nscore (Neuroticism Score): Qu

#### Data Dictionary

Demographic & client data:
- age : age of the individual (float)
- gender : gender of the individual (float)
- education : education of the individual (float)
- country : country of residence (float)
- ethnicity : ethnicity of the individual (float)

Data about Personality measurements which include Big Five Personality Traits (OCEAN) (Openness to Experience (Oscore),Conscientiousness (Cscore),Extraversion (Escore),Agreeableness (Ascore),Neuroticism (Nscore)). This model is one of the most widely used frameworks in psychology. It identifies five broad dimensions of personality. As well BIS-11 (impulsivity), and ImpSS (sensation seeking) :
- Nscore :  (float)
- Escore :  (float)
- Oscore :  (float)
- Ascore :  (float)
- Cscore :  (float)
- Impulsive :  (float)
- SS :  (float)


Data about kind of the drugs consumed and the time of consumption devided into 7 classes: (CL0,CL1,CL2,CL3,CL4,CL5,CL6)

- Alcohol (categorical)
- Amphet (categorical)
- Amyl (categorical)
- Benzos (categorical)
- Caff (categorical)
- Cannabis (categorical)
- Choc (categorical)
- Coke (categorical)
- Crack (categorical)
- Ecstasy (categorical)
- Heroin (categorical)
- Ketamine (categorical)
- Legalh (categorical)
- LSD (categorical)
- Meth (categorical)
- Mushrooms (categorical)
- Nicotine (categorical)
- Semer (categorical)
- VSA (categorical)


The data downloaded came in the above listed data types and format. The task in this notebook is to map these features into human readable format from code format.

# INDEX

[Data Loading and Understanding](#Data)

- [Part 1: Data Exploration](#Reading)


## Data Loading and Understanding <a id='Data'></a>

Reading in the data and performing any required data cleaning steps

<a id='Reading'></a>

In [3]:
# Load the Drug Consumption data
ddf = pd.read_csv('drug_consumption.csv')

In [4]:
# Getting the columns

ddf.columns

Index(['ID', 'Age', 'Gender', 'Education', 'Country', 'Ethnicity', 'Nscore',
       'Escore', 'Oscore', 'Ascore', 'Cscore', 'Impulsive', 'SS', 'Alcohol',
       'Amphet', 'Amyl', 'Benzos', 'Caff', 'Cannabis', 'Choc', 'Coke', 'Crack',
       'Ecstasy', 'Heroin', 'Ketamine', 'Legalh', 'LSD', 'Meth', 'Mushrooms',
       'Nicotine', 'Semer', 'VSA'],
      dtype='object')

In [5]:
# Getting the header
ddf.head()

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,...,Ecstasy,Heroin,Ketamine,Legalh,LSD,Meth,Mushrooms,Nicotine,Semer,VSA
0,1,0.49788,0.48246,-0.05921,0.96082,0.126,0.31287,-0.57545,-0.58331,-0.91699,...,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
1,2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,...,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
2,3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.6209,...,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
3,4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,...,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
4,5,0.49788,0.48246,1.98437,0.96082,-0.31685,0.73545,-1.6334,-0.45174,-0.30172,...,CL1,CL0,CL0,CL1,CL0,CL0,CL2,CL2,CL0,CL0


In [6]:
# Getting the info
ddf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1885 entries, 0 to 1884
Data columns (total 32 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         1885 non-null   int64  
 1   Age        1885 non-null   float64
 2   Gender     1885 non-null   float64
 3   Education  1885 non-null   float64
 4   Country    1885 non-null   float64
 5   Ethnicity  1885 non-null   float64
 6   Nscore     1885 non-null   float64
 7   Escore     1885 non-null   float64
 8   Oscore     1885 non-null   float64
 9   Ascore     1885 non-null   float64
 10  Cscore     1885 non-null   float64
 11  Impulsive  1885 non-null   float64
 12  SS         1885 non-null   float64
 13  Alcohol    1885 non-null   object 
 14  Amphet     1885 non-null   object 
 15  Amyl       1885 non-null   object 
 16  Benzos     1885 non-null   object 
 17  Caff       1885 non-null   object 
 18  Cannabis   1885 non-null   object 
 19  Choc       1885 non-null   object 
 20  Coke    

In [7]:
# Getting the Age column values and count
print(ddf['Age'].value_counts())

Age
-0.95197    643
-0.07854    481
 0.49788    356
 1.09449    294
 1.82213     93
 2.59171     18
Name: count, dtype: int64


In [8]:
# Assigning the actual age value to the float values
ddf['Age'] = ddf['Age'].map({-0.95197:'18 - 24',
                            -0.07854:'25 - 34',
                            0.49788:'35 - 44',
                            1.09449:'45 - 54',
                            1.82213:'55 - 64',
                            2.59171:'65+'
                            }).astype(object)

In [9]:
# Checking and confirming updated mapped values for the Age column
print(ddf['Age'].value_counts())

Age
18 - 24    643
25 - 34    481
35 - 44    356
45 - 54    294
55 - 64     93
65+         18
Name: count, dtype: int64


In [10]:
# Getting the Gender column values and count
print(ddf['Gender'].value_counts())

Gender
-0.48246    943
 0.48246    942
Name: count, dtype: int64


In [11]:
# Assigning the actual Gender value to the float values
ddf['Gender'] = ddf['Gender'].map({0.48246:'Female',
                             -0.48246:'Male'  
                            }).astype(object)

In [12]:
# Getting the Education column values and count
print(ddf['Education'].value_counts())

Education
-0.61113    506
 0.45468    480
 1.16365    283
-0.05921    270
-1.22751    100
-1.73790     99
 1.98437     89
-1.43719     30
-2.43591     28
Name: count, dtype: int64


In [13]:
# Assigning the actual Education value to the float values
ddf['Education'] = ddf['Education'].map({-2.43591:'Left School Before 16 years',
                             -1.73790:'Left School at 16 years',
                             -1.43719:'Left School at 17 years',
                             -1.22751:'Left School at 18 years',
                             -0.61113:'Some College,No Certificate Or Degree',
                             -0.05921:'Professional Certificate/ Diploma',
                             0.45468:'University Degree',
                             1.16365:'Masters Degree',
                             1.98437:'Doctorate Degree'
                            }).astype(object)

In [14]:
# Getting the Country column values and count
print(ddf['Country'].value_counts())

Country
 0.96082    1044
-0.57009     557
-0.28519     118
 0.24923      87
-0.09765      54
 0.21128      20
-0.46841       5
Name: count, dtype: int64


In [15]:
# Assigning the actual Education value to the float values
ddf['Country'] = ddf['Country'].map({
                                    -0.09765:'Australia',
                                    0.24923:'Canada',
                                    -0.46841:'New Zealand',
                                    -0.28519:'Other',
                                    0.21128:'Republic of Ireland',
                                    0.96082:'UK',
                                    -0.57009:'USA'
                                    }).astype(object)



In [16]:
# Getting the Ethnicity column values and count
print(ddf['Ethnicity'].value_counts())

Ethnicity
-0.31685    1720
 0.11440      63
-1.10702      33
-0.50212      26
 0.12600      20
-0.22166      20
 1.90725       3
Name: count, dtype: int64


In [17]:
# Assigning the actual Education value to the float values
ddf['Ethnicity'] = ddf['Ethnicity'].map({
                                        -0.50212:'Asian',
                                        -1.10702:'Black',
                                        1.90725:'Mixed-Black/Asian',
                                        0.126:'Mixed-White/Asian',
                                        -0.22166:'Mixed-White/Black',
                                        0.1144:'Other',
                                        -0.31685:'White'
                                        }).astype(object)

In [18]:
# Getting the Nscore column values and count
print(ddf['Nscore'].value_counts())

Nscore
-0.46725    87
 0.41667    80
-0.34799    78
 0.62967    77
-0.14882    76
 0.04257    73
-0.79151    70
-0.05188    69
-0.24649    68
 0.13606    67
 1.02119    67
 0.31287    66
-0.92104    65
 0.22393    63
 0.52135    61
-0.58016    61
-0.67825    60
-1.05308    57
-1.19430    56
 0.82562    51
 1.23461    49
 0.73545    49
 1.37297    40
 0.91093    37
-1.32828    35
-1.69163    31
-1.43907    29
 1.13281    27
 1.60383    27
-1.55078    26
-1.86962    24
 1.49158    24
 1.83990    20
 1.72012    17
-2.05048    16
 1.98437    15
 2.12700    11
-2.21844    10
 2.28554    10
-2.75696     7
 2.46262     6
 2.82196     5
-2.34360     4
-2.52197     4
-2.42317     3
 2.61139     3
 3.27393     2
-3.15735     1
-3.46436     1
Name: count, dtype: int64


In [19]:
# Assigning the actual Nscore value to the float values
ddf['Nscore'] = ddf['Nscore'].map({
-3.46436:12,-1.32828:24,0.04257:36,1.23461:48,
-3.15735:13,-1.1943:25,0.13606:37,1.37297:49,
-2.75696:14,-1.05308:26,0.22393:38,1.49158:50,
-2.52197:15,-0.92104:27,0.31287:39,1.60383:51,
-2.42317:16,-0.79151:28,0.41667:40,1.72012:52,
-2.3436:17,-0.67825:29,0.52135:41,1.8399:53,
-2.21844:18,-0.58016:30,0.62967:42,1.98437:54,
-2.05048:19,-0.46725:31,0.73545:43,2.127:55,
-1.86962:20,-0.34799:32,0.82562:44,2.28554:56,
-1.69163:21,-0.24649:33,0.91093:45,2.46262:57,
-1.55078:22,-0.14882:34,1.02119:46,2.61139:58,
-1.43907:23,-0.05188:35,1.13281:47,2.82196:59,
3.27393:60}).astype(int)

In [20]:
# Getting the Escore column values and count
print(ddf['Escore'].value_counts())

Escore
 0.00332    130
 0.16767    116
 0.32197    109
-0.15487    107
-0.30033    106
 0.47617    105
 0.63779    103
 0.80523     91
-0.43999     90
-0.57545     89
-0.94779     77
 0.96248     69
-0.80615     68
 1.11406     64
 1.28610     62
-0.69509     58
-1.23177     55
-1.09207     52
-1.37639     38
 1.45421     37
 1.74091     34
-1.50796     32
 1.58487     25
-1.76250     23
-1.63340     23
 1.93886     21
-1.92173     21
 2.12700     15
 2.32338     10
 2.57309      9
-2.11437      9
-2.32338      8
-2.72827      6
-2.21069      5
-2.03972      4
-2.53830      3
-2.44904      3
-3.27393      2
 2.85950      2
 3.27393      2
 3.00537      1
-3.00537      1
Name: count, dtype: int64


In [21]:
# Assigning the actual Escore value to the float values
ddf['Escore'] = ddf['Escore'].map({
-3.27393:16,-1.7625:27,-0.30033:38,1.45421:49,
-3.00537:17,-1.6334:28,-0.15487:39,1.58487:50,
-3.00537:18,-1.50796:29,0.00332:40,1.74091:51,
-2.72827:19,-1.37639:30,0.16767:41,1.93886:52,
-2.5383:20,-1.23177:31,0.32197:42,2.127:53,
-2.44904:21,-1.09207:32,0.47617:43,2.32338:54,
-2.32338:22,-0.94779:33,0.63779:44,2.57309:55,
-2.21069:23,-0.80615:34,0.80523:45,2.8595:56,
-2.11437:24,-0.69509:35,0.96248:46,2.8595:57,
-2.03972:25,-0.57545:36,1.11406:47,3.00537:58,
-1.92173:26,-0.43999:37,1.2861:48,3.27393:59
}).astype(int)

In [22]:
# Getting the Oscore column values and count
print(ddf['Oscore'].value_counts())

Oscore
-0.01928    134
 0.29338    116
 0.14143    107
-0.17779    103
-0.31776    101
 0.44585     98
-0.58331     87
 0.72330     87
 0.88309     87
-0.45174     86
 0.58331     83
 1.06238     81
-0.71727     76
-0.84732     68
-1.11902     64
 1.43533     63
-0.97631     60
 1.24033     57
-1.27553     51
-1.42424     39
 1.65653     38
 1.88511     34
-1.55521     26
-1.68062     25
-1.82919     23
 2.15324     19
 2.44904     13
-1.97495     13
-2.39883     11
-2.09015      9
-2.21069      9
 2.90161      7
-2.85950      4
-2.63199      4
-3.27393      2
Name: count, dtype: int64


In [23]:
# Assigning the actual Oscore value to the float values.
#This attribute has 'nan' value, that's why I converted into 'str' for now.

ddf['Oscore'] = ddf['Oscore'].map({
-3.27393:24,-1.11902:38,0.58331:50,
-2.8595:26,-0.97631:39,0.7233:51,
-2.63199:28,-0.84732:40,0.88309:52,
-2.39883:29,-0.71727:41,1.06238:53,
-2.21069:30,-0.58331:42,1.24033:54,
-2.09015:31,-0.45174:43,1.43533:55,
-1.97495:32,-0.31776:44,1.65653:56,
-1.82919:33,-0.17779:45,1.88511:57,
-1.68062:34,-0.01928:46,1.15324:58,
-1.55521:35,0.14143:47,2.44904:59,
-1.42424:36,0.29338:48,2.90161:60,
-1.27553:37,0.44585:49
}).astype(str)

In [24]:
print(ddf['Oscore'].value_counts())

Oscore
46.0    134
48.0    116
47.0    107
45.0    103
44.0    101
49.0     98
42.0     87
51.0     87
52.0     87
43.0     86
50.0     83
53.0     81
41.0     76
40.0     68
38.0     64
55.0     63
39.0     60
54.0     57
37.0     51
36.0     39
56.0     38
57.0     34
35.0     26
34.0     25
33.0     23
nan      19
59.0     13
32.0     13
29.0     11
31.0      9
30.0      9
60.0      7
26.0      4
28.0      4
24.0      2
Name: count, dtype: int64


In [25]:
# Getting the Ascore column values and count
print(ddf['Ascore'].value_counts())

Ascore
 0.13136    118
-0.30172    114
 0.28783    112
-0.01729    105
 0.76096    104
-0.60633    102
-0.15487    101
 0.59042    100
 0.43852    100
-0.45321     98
 0.94156     85
-0.91699     83
-0.76096     82
 1.11406     68
-1.07533     62
 1.28610     58
-1.21213     45
-1.34289     42
 1.45039     39
 1.81866     36
 1.61108     36
-1.47955     34
-1.62090     30
-1.77200     24
-1.92595     18
 2.03972     16
 2.23427     14
-2.07848     13
 2.46262      8
-2.21844      8
-2.53830      7
 2.75696      7
-2.35413      7
-2.78793      2
-3.15735      1
-3.00537      1
-3.46436      1
-2.90161      1
 3.15735      1
-2.70172      1
 3.46436      1
Name: count, dtype: int64


In [26]:
# Assigning the actual Ascore value to the float values
ddf['Ascore'] = ddf['Ascore'].map({
-3.46436:12,-1.34289:34,0.76096:48,
-3.15735:16,-1.21213:35,0.94156:49,
-3.00537:18,-1.07533:36,1.11406:50,
-2.90161:23,-0.91699:37,1.2861:51,
-2.78793:24,-0.76096:38,1.45039:52,
-2.70172:25,-0.60633:39,1.61108:53,
-2.5383:26,-0.45321:40,1.81866:54,
-2.35413:27,-0.30172:41,2.03972:55,
-2.21844:28,-0.15487:42,2.23427:56,
-2.07848:29,-0.01729:43,2.46262:57,
-1.92595:30,0.13136:44,2.75696:58,
-1.772:31,0.28783:45,3.15735:59,
-1.6209:32,0.43852:46,3.46436:60,
-1.47955:33,0.59042:47
}).astype(int)

In [27]:
# Getting the Cscore column values and count
print(ddf['Cscore'].value_counts())

Cscore
 0.58489    113
 0.25953    111
 0.41594    111
-0.00665    105
-0.14277     99
-0.27607     97
 0.75830     95
 0.93949     95
 0.12331     90
-0.40581     87
-0.65253     81
-0.52745     77
 1.13407     76
-0.78155     69
-0.89891     55
-1.01450     55
-1.13788     49
 1.30612     47
 1.46191     43
-1.38502     41
-1.25773     39
 1.63088     34
-1.51840     29
 1.81175     28
 2.04506     27
-1.78169     25
-1.64101     24
 2.33337     13
-1.92173     13
-2.04506     13
-2.18109      9
 2.63199      8
-2.30408      6
-2.57309      5
-2.42317      5
-2.90161      3
 3.00537      3
-2.72827      2
 3.46436      1
-3.15735      1
-3.46436      1
Name: count, dtype: int64


In [28]:
# Assigning the actual Cscore value to the float values
ddf['Cscore'] = ddf['Cscore'].map({
-3.46436:17,-1.25773:32,0.58489:46,
-3.15735:19,-1.13788:33,0.7583:47,
-2.90161:20,-1.0145:34,0.93949:48,
-2.72827:21,-0.89891:35,1.13407:49,
-2.57309:22,-0.78155:36,1.30612:50,
-2.42317:23,-0.65253:37,1.46191:51,
-2.30408:24,-0.52745:38,1.63088:52,
-2.18109:25,-0.40581:39,1.81175:53,
-2.04506:26,-0.27607:40,2.04506:54,
-1.92173:27,-0.14277:41,2.33337:55,
-1.78169:28,-0.00665:42,2.63199:56,
-1.64101:29,0.12331:43,3.00537:57,
-1.5184:30,0.25953:44,3.46436:59,
-1.38502:31,0.41594:45
}).astype(int)

In [29]:
# Getting the Impulsive column values and count
print(ddf['Impulsive'].value_counts())

Impulsive
-0.21712    355
-0.71126    307
-1.37983    276
 0.19268    257
 0.52975    216
 0.88113    195
 1.29221    148
 1.86203    104
-2.55524     20
 2.90161      7
Name: count, dtype: int64


In [30]:
# Define the ranges and corresponding categories for the "Impulsive" trait
impulsive_ranges = {
    "Very Low": (-float('inf'), -2),
    "Low": (-2, -1.5),
    "Moderate": (-1.5, -1),
    "Average": (-1, 1),
    "High": (1, 1.5),
    "Very High": (1.5, float('inf'))
}

# Function to map values based on the ranges for the "Impulsive" trait
def map_impulsive_value(value):
    for category, (lower, upper) in impulsive_ranges.items():
        if lower <= value < upper:
            return category
    return "Undefined"

# Assuming your data is in a DataFrame column named 'Impulsive_Value'
ddf['Impulsive_class'] = ddf['Impulsive'].apply(map_impulsive_value)


In [31]:
# Getting the SS column values and count
print(ddf['SS'].value_counts())

SS
 0.40148    249
-0.21575    223
 0.07987    219
-0.52593    211
 0.76540    211
 1.22470    210
-0.84637    169
-1.18084    132
 1.92173    103
-1.54858     87
-2.07848     71
Name: count, dtype: int64


In [32]:
# Define the ranges and corresponding categories
ranges = {
    "Very Low": (-float('inf'), -1.5),
    "Low": (-1.5, -1),
    "Average": (-1, 1),
    "High": (1, 1.5),
    "Very High": (1.5, float('inf'))
}

# Function to map values based on the ranges
def map_ss_value(value):
    for category, (lower, upper) in ranges.items():
        if lower <= value < upper:
            return category
    return "Undefined"

# Assuming your data is in a DataFrame column named 'SS_Value'
ddf['SS_class'] = ddf['SS'].apply(map_ss_value)


# The actual value of the codes are 
'CL0':'Never'
'CL1':'Decade Ago'
'CL2':'Last Decade'
'CL3':'Last Year'
'CL4':'Last Month'
'CL5':'Last Week'
'CL6':'Last Day'

In [33]:
# Mapping all the drug types columns values to more readable values
List_of_Columns = ddf[ddf.columns[13:32]]

#for col in List_of_Columns:
#    ddf[col] = ddf[col].map({
#           'CL0':7,
#           'CL1':6,
#           'CL2':5,
#           'CL3':4,
#           'CL4':3,
#           'CL5':2,
#           'CL6':1
#            })

for col in List_of_Columns:
    ddf[col] = ddf[col].map({
        'CL0':'Never',
        'CL1':'Decade Ago', 
        'CL2':'Last Decade', 
        'CL3':'Last Year', 
        'CL4':'Last Month', 
        'CL5':'Last Week',
        'CL6':'Last Day'
            })
    
    

In [34]:
ddf.head()

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,...,Ketamine,Legalh,LSD,Meth,Mushrooms,Nicotine,Semer,VSA,Impulsive_class,SS_class
0,1,35 - 44,Female,Professional Certificate/ Diploma,UK,Mixed-White/Asian,39,36,42.0,37,...,Never,Never,Never,Never,Never,Last Decade,Never,Never,Average,Low
1,2,25 - 34,Male,Doctorate Degree,UK,White,29,52,55.0,48,...,Last Decade,Never,Last Decade,Last Year,Never,Last Month,Never,Never,Average,Average
2,3,35 - 44,Male,Professional Certificate/ Diploma,UK,White,31,45,40.0,32,...,Never,Never,Never,Never,Decade Ago,Never,Never,Never,Moderate,Average
3,4,18 - 24,Female,Masters Degree,UK,White,34,34,46.0,47,...,Last Decade,Never,Never,Never,Never,Last Decade,Never,Never,Moderate,Low
4,5,35 - 44,Female,Doctorate Degree,UK,White,43,28,43.0,41,...,Never,Decade Ago,Never,Never,Last Decade,Last Decade,Never,Never,Average,Average


In [35]:
# Saving all the mapping in a new csv
ddf1 = ddf.to_csv('drug_consumption_mapped.csv')

In [36]:
ddf.head()

Unnamed: 0,ID,Age,Gender,Education,Country,Ethnicity,Nscore,Escore,Oscore,Ascore,...,Ketamine,Legalh,LSD,Meth,Mushrooms,Nicotine,Semer,VSA,Impulsive_class,SS_class
0,1,35 - 44,Female,Professional Certificate/ Diploma,UK,Mixed-White/Asian,39,36,42.0,37,...,Never,Never,Never,Never,Never,Last Decade,Never,Never,Average,Low
1,2,25 - 34,Male,Doctorate Degree,UK,White,29,52,55.0,48,...,Last Decade,Never,Last Decade,Last Year,Never,Last Month,Never,Never,Average,Average
2,3,35 - 44,Male,Professional Certificate/ Diploma,UK,White,31,45,40.0,32,...,Never,Never,Never,Never,Decade Ago,Never,Never,Never,Moderate,Average
3,4,18 - 24,Female,Masters Degree,UK,White,34,34,46.0,47,...,Last Decade,Never,Never,Never,Never,Last Decade,Never,Never,Moderate,Low
4,5,35 - 44,Female,Doctorate Degree,UK,White,43,28,43.0,41,...,Never,Decade Ago,Never,Never,Last Decade,Last Decade,Never,Never,Average,Average
