# Relationship between marijuana/cannabis use and household income 

   Many studies have found a relationship between substance abuse and socioeconomic background, showing that substance abuse in young adults/children is associated with lower socioeconomic status. The data set we chose to help understand these associations is a survey done in 2018/2019 regarding student tobacco, alcohol and drug use. We want to find out if there exists a relationship between marijuana/cannabis use and household income.

Methods: 
We will use the following two variables from the data set and use classification, rather than regression 

   **CAN_040** : In the last 30 days, how often did you use marijuana or cannabis?

   **DVHHINC2** : Median Household Income of the area where the respondent’s school is located according to the Canadian 2016 census data

**Describe one way we will visualize the results**
Bar plot: To display the relationship between income and number of people using marijuana, we will plot income on the x-axis and number of drug users on the y-axis. 
	

**Expected Outcomes and Significance**
	We expect to find that if household income is higher, then marijuana/cannabis use would be lower, so a negative relationship between these two variables.

**what impact will these findings have?**
If our findings show that a particular income group is associated with drug use, organizations would be able to address particular income areas in an attempt to lower substance abuse.

If our hypothesis is correct, we could address further questions such as :
Would individuals who come from a lower income household have better access to other types of drugs? 
Why would higher income individuals have lower drug use?
What other reasoning would be behind higher drug use among individuals who come from lower income households? Is bullying a factor?


In [1]:
library(repr)
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [3]:
drugdata <- read_tsv("cstdata.tab")

[1mRows: [22m[34m62850[39m [1mColumns: [22m[34m185[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[32mdbl[39m (185): SCANID, MODULE, PROVID, SCHID, GRADE, SEX, SS_010, SS_020, TS_011...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [5]:
head(drugdata)

SCANID,MODULE,PROVID,SCHID,GRADE,SEX,SS_010,SS_020,TS_011,TV_010,⋯,DVTY2ST,DVLAST30,DVAMTSMK,DVCIGWK,DVNDSMK,DVAVCIGD,DVRES,DVURBAN,DVHHINC2,WTPP
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
100224,1,35,3589267,9,1,1,13,2,3,⋯,1,1,1,8,5,2,1,2,60000,33.15
100225,1,35,3589267,9,1,1,13,1,1,⋯,1,1,10,70,7,10,1,2,60000,33.15
100226,1,35,3589267,12,1,2,96,4,3,⋯,7,2,96,996,96,96,3,2,60000,75.14
100227,1,35,3589267,12,2,1,2,3,2,⋯,4,1,0,0,0,0,1,2,60000,105.7
100228,1,35,3589267,12,1,1,2,3,3,⋯,6,2,96,996,96,96,1,2,60000,75.14
100229,1,35,3589267,11,1,2,96,4,3,⋯,7,2,96,996,96,96,1,2,60000,63.87


First, select only the data that we need. In our case, select the variable CAN_040, which tells us how often an individual has used marijuana/cannabis in the lsat 30 days; and select DVHHINC2 for average household income of the individual's general school area.

Re-naming the column names so they are more readable.

Showing only first 6 rows of selected data 

In [13]:
selected_drugdata <- select(drugdata, CAN_040, DVHHINC2)
colnames(selected_drugdata) <- c("cannabis_use","household_income")

head(selected_drugdata)

cannabis_use,household_income
<dbl>,<dbl>
96,60000
6,60000
2,60000
2,60000
2,60000
96,60000
