# Segmentation Capstone Project
## Question: 
- **Using a consumer segmentation about education/motivation to learning, can I predict income level**
- **Using a consumer segmentation about education/motivation to continue learning, can I predict employed or student**
    - Why this question is relevant: I am hypothesizing that the population is segmented in how engaged they are in their education and technology knowledge. While this data is not specifically tied to a product, you can imagine a situation where an online education provider would want to target a specific consumer group, or where a company would like to assess how engaged/willing their employees would be in expanding their knowledge by engaging in an online training course. By creating a segmentation of the market, it allows for the building of different marketing/targeting strategies and who will be most likely to engage with the product. 
    - Having this segmentation then allows us to identify these groups of people in order to provide a targeted add. If I can build a model that accurately predicts which segment a consumer falls into, I can then show an ad for my different products to the segments I have identified as the most valuable. 

## Steps: 
1. Segment the data into consumers using attitudinal and behavioral questions (using the segmentation variables)
    - Will probably go through some iterations of how best to segment
2. Profile the segments (using the profiling variables)
3. Given a new respondent, build a model that predicts which segment the new consumer falls into
    

## Data: 
- Data comes from Pew Research Center: http://www.pewinternet.org/datasets/2018/
- Specifically using the Oct. 13-Nov. 15, 2015 – Educational Ecosystem: http://www.pewinternet.org/dataset/october-2015-educational-ecosystem/
    - 'This dataset contains questions about technology use, lifelong learning, and career and personal development.'
- 2752 respondents
    - 2327 Internet users (potentially only filter to internet users depending on the question I am trying to answer (i.e. if I am an online education platform I wouldn't target to non-internet users)
    
## Variables
### Segmentation Variables
- Q2	Now I’d like to know how important, if at all, you think it is for people to make an effort to learn new things in some different areas of life. [FOR FIRST TWO RANDOMIZED ITEMS: (First,/Next,) do you think it is very important, somewhat important, not too important, or not at all important for people to make an effort to learn NEW things related to [INSERT ITEMS; RANDOMIZE]?] {new}
[FOR REMAINING ITEMS: How about learning NEW things related to [INSERT NEXT ITEM]? [READ AS NECESSARY: Do you think it is very important, somewhat important, not too important, or not at all important for people to make an effort to learn NEW things related to (ITEM)?]]
        a.	Their jobs
        b.	Their hobbies or interests
        c.	Things happening in society, such as developments in science, technology, entertainment, or culture
        NO ITEM D
        e.	Their local community
        CATEGORIES
        1	Very important
        2	Somewhat important
        3	Not too important
        4	Not at all important
        8	(VOL.) Don't know
        9	(VOL.) Refused


- Q3	How well do each of the following statements describe you? How about this statement: [INSERT ITEMS; RANDOMIZE]. [READ FOR FIRST ITEM, THEN AS NECESSARY: Does this describe you very well, somewhat well, not too well, or not well at all?] Next: [INSERT NEXT ITEM]. {new; based on Education Dept 2005 survey}
        a.	I often find myself looking for new opportunities to grow as a person {Education Dept 2005 survey}
        b.	I am not the type of person who feels the need to probe deeply into new situations or things {Education Dept 2005 survey}
        c.	I like to gather as much information as I can when I come across something that I am not familiar with {Education Dept 2005 survey}
        d.	I am easily distracted when I try to concentrate {new} 
        e.	I am really glad I am no longer in school and don’t have to go to classes anymore {new} 
        f.	I think of myself as a lifelong learner {new}
        CATEGORIES
        1	Very well
        2	Somewhat well
        3	Not too well
        4	Not well at all
        5	(VOL.) Still in school [PROGRAM FOR ITEM e ONLY]
        8	(VOL.) Don't know
        9	(VOL.) Refused

- Q19	Please tell me how well each of the following statements describes you. First: [INSERT ITEMS; RANDOMIZE]. [READ FOR FIRST ITEM, THEN AS NECESSARY: Does this describe you very well, somewhat well, not too well, or not well at all?] Next: [INSERT NEXT ITEM]. {new}
        a.	When I get a new electronic device, I usually need someone else to set it up or show me how to use it
        b.	I am more productive because of all of my electronic information devices
        c.	I find it difficult to know whether the information I find online is trustworthy
        d.	Between phone calls, texts, emails, social media, or other messages, I deal with too much information in my daily life
        CATEGORIES
        1	Very well
        2	Somewhat well
        3	Not too well
        4	Not well at all 
        8	(VOL.) Don't know
        9	(VOL.) Refused


### Model Variables to predict segment
(NOT USING THE VARIABLES INCLUDED IN THE SEGMENTATION)
- Gender, Education, Income, Employment Status, How they feel about their career, political party etc.


## Next Steps (Out of Scope)
- Ideally this segmentation would be linked to some product app where the segmentation would allow the app/marketers build a plan to target the most attractive segments
- Would love to incorporate spend data

# Intial Exploratory Data Analysis

In [9]:
import pandas as pd

In [10]:
#Read in Data
data = pd.read_csv('SEGMENTATION/October 13 - November 15, 2015 - Educational Ecosystem/October 13-November 15, 2015 - Educational Ecosystem - CSV.csv')

In [11]:
data.head()

Unnamed: 0,psraid,sample,int_date,lang,version,scregion,form,q1,eminuse,intmob,...,edinstos,q10jos,q15fos,raceos,wrkplos,usr,cregion,state,weight,standwt
0,100007,1,101315,1,1,3,2,1,1,1,...,,,,,,S,3,12,1.34375,0.44258
1,100008,1,101315,1,1,3,2,1,1,1,...,,,FARM,,,U,3,37,1.65625,0.545505
2,100010,1,101315,1,1,1,2,2,1,1,...,,,,,,R,1,42,3.53125,1.163058
3,100011,1,101615,1,4,1,2,2,1,1,...,,,,,,R,1,23,3.28125,1.080718
4,100014,1,101315,1,1,3,1,2,1,1,...,,,,,,R,3,51,5.78125,1.904121


In [12]:
#Show first 50 variables
data.iloc[:, :50].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2752 entries, 0 to 2751
Data columns (total 50 columns):
psraid      2752 non-null int64
sample      2752 non-null int64
int_date    2752 non-null int64
lang        2752 non-null int64
version     2752 non-null int64
scregion    2752 non-null int64
form        2752 non-null int64
q1          2752 non-null int64
eminuse     2752 non-null int64
intmob      2752 non-null int64
home3nw     2752 non-null object
bbhome1     2752 non-null object
bbhome2     2752 non-null object
device1a    2752 non-null object
smart1      2752 non-null object
snsint2     2752 non-null object
q2a         2752 non-null int64
q2b         2752 non-null int64
q2c         2752 non-null int64
q2e         2752 non-null int64
q3a         2752 non-null int64
q3b         2752 non-null int64
q3c         2752 non-null int64
q3d         2752 non-null int64
q3e         2752 non-null int64
q3f         2752 non-null int64
emplnw3     2752 non-null int64
stud        2752 non-nu

In [13]:
#Show the rest of the variables
data.iloc[:, 50:].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2752 entries, 0 to 2751
Data columns (total 77 columns):
q12d          2752 non-null object
q13a          2752 non-null int64
q13b          2752 non-null int64
q13c          2752 non-null int64
q13e          2752 non-null int64
q13f          2752 non-null int64
q15a          2752 non-null object
q15b          2752 non-null object
q15c          2752 non-null object
q15d          2752 non-null object
q15e          2752 non-null object
q15f          2752 non-null object
q16           2752 non-null object
q17a          2752 non-null object
q17b          2752 non-null object
q17c          2752 non-null object
q17d          2752 non-null object
q17f          2752 non-null object
q18a          2752 non-null object
q18b          2752 non-null object
q18c          2752 non-null object
q18d          2752 non-null object
q18e          2752 non-null object
q19a          2752 non-null int64
q19b          2752 non-null int64
q19c          2752 non-nu

##### EMINUSE	Do you use the internet or email, at least occasionally? {PIAL Trend}
        1	Yes
        2	No
        8	(VOL.) Don't know
        9	(VOL.) Refused


In [14]:
## Internet Users (1 is yes, 2 is no, 8 is don't know)
data['eminuse'].value_counts()

1    2327
2     421
8       3
9       1
Name: eminuse, dtype: int64

###### INC	Last year -- that is in 2014 -- what was your total family income from all sources, before taxes? Just stop me when I get to the right category... [READ] {Master INC2}
        1	Less than 10,000 
        2	10 to under 20,000 
        3	20 to under 30,000
        4	30 to under 40,000
        5	40 to under 50,000 
        6	50 to under 75,000
        7	75 to under 100,000
        8	100 to under 150,000, OR
        9	150,000 or more? 
        98	(VOL.) Don't know
        99	(VOL.) Refused 


In [15]:
data['inc'].value_counts()

6     368
7     299
8     292
3     278
2     261
4     258
9     254
99    225
1     218
5     197
98    102
Name: inc, dtype: int64

##### Q1	Thinking now about job opportunities where you live, would you say there are plenty of jobs available in your community or are jobs difficult to find? {5-15} {QID:x010617-20} {new to PIAL surveys}
    1	Plenty of jobs available
    2	Jobs are difficult to find
    3	(VOL.) Lots of some jobs, few of others
    8	(VOL.) Don't know
    9	(VOL.) Refused


In [16]:
data['q1'].value_counts()

2    1251
1    1045
8     254
3     161
9      41
Name: q1, dtype: int64

##### EMPLNW3	Are you now employed full-time, part-time, or are you not employed for pay? {new}
        1	Employed full-time
        2	Employed part-time
        3	Not employed for pay
        8	(VOL.) Don't know
        9	(VOL.) Refused


In [17]:
data['emplnw3'].value_counts()

1    1246
3    1163
2     331
9       7
8       5
Name: emplnw3, dtype: int64

In [18]:
data['q13a'].value_counts()

2    2065
1     683
8       3
9       1
Name: q13a, dtype: int64

##### Q20	Overall, how confident do you feel using computers, smartphones, or other electronic devices to do the things you need to do online? Do you feel very confident, somewhat confident, only a little confident, or not at all confident? {new}
        1	Very confident
        2	Somewhat confident
        3	Only a little confident
        4	Not at all confident
        8	(VOL.) Don't know
        9	(VOL.) Refused


In [19]:
data['q20'].value_counts()

1    1290
2     815
      294
3     244
4     102
9       4
8       3
Name: q20, dtype: int64

##### Q21	Please tell me how familiar, if at all, you are with the following educational resources or concepts. (First, how familiar are you with / Next,) [INSERT ITEMS; RANDOMIZE]? [READ FOR FIRST ITEM, THEN AS NECESSARY: Are you very familiar, somewhat familiar, not too familiar, or not at all familiar?] {new}
        a.	Distance learning
        b.	Digital badges
        c.	Khan Academy
        d.	Common core standards
        e.	Massively open online courses, or MOOCs [PRONOUNCED: MOOKs] – such as Coursera, edX, or Udacity
        CATEGORIES
        1	Very familiar
        2	Somewhat familiar
        3	Not too familiar
        4	Not at all familiar
        8	(VOL.) Don't know
        9	(VOL.) Refused


In [21]:
data['q21a'].value_counts()

4    1270
2     705
1     422
3     328
8      23
9       4
Name: q21a, dtype: int64

In [22]:
data['q21c'].value_counts()

4    1914
2     299
3     257
1     251
8      30
9       1
Name: q21c, dtype: int64

In [23]:
data['q21e'].value_counts()

4    1841
2     367
3     364
1     152
8      22
9       6
Name: q21e, dtype: int64

##### REG	Which of these statements best describes you? [READ IN ORDER] [INSTRUCTION: BE SURE TO CLARIFY WHETHER RESPONDENT IS ABSOLUTELY CERTAIN THEY ARE REGISTERED OR ONLY PROBABLY REGISTERED; IF RESPONDENT VOLUNTEERS THAT THEY ARE IN NORTH DAKOTA AND DON’T HAVE TO REGISTER, PUNCH 1] {QID:reg} {new to PIAL surveys}
        1	Are you ABSOLUTELY CERTAIN that you are registered to vote at your current address, OR
        2	Are you PROBABLY registered, but there is a chance your registration has lapsed, OR
        3	Are you NOT registered to vote at your current address?
        8	(VOL.) Don't know
        9	(VOL.) Refused


In [25]:
data['reg'].value_counts()

1    2000
3     581
2     139
9      16
8      16
Name: reg, dtype: int64

##### PARTY In politics TODAY, do you consider yourself a Republican, Democrat, or independent?
        1	Republican
        2	Democrat
        3	Independent
        4	(VOL.) No preference
        5	(VOL.) Other party
        8	(VOL.) Don't know
        9	(VOL.) Refused


In [26]:
data['party'].value_counts()

3    902
2    881
1    669
4    160
9     71
8     54
5     15
Name: party, dtype: int64

# USE THIS DATA


In [3]:
import pandas as pd

In [1]:
!ls

[1m[36mNEIGHBORHOODS[m[m          [1m[36mSEGMENTATION[m[m           SegmentationData.ipynb


In [4]:
data = pd.read_csv('SEGMENTATION/2017 Pew Research Center Science and News Survey/Segmentation_data.csv')

In [12]:
data.iloc[:, :100].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 100 columns):
CaseID             4024 non-null int64
HOBBY2_1           4024 non-null object
HOBBY2_2           4024 non-null object
HOBBY2_3           4024 non-null object
GREATPAST_OE1      4024 non-null object
GREATPAST_OE2      4024 non-null object
GREATPAST_OE3      4024 non-null object
GREATFUTURE_OE1    4024 non-null object
GREATFUTURE_OE2    4024 non-null object
GREATFUTURE_OE3    4024 non-null object
FAKE_OE1           4024 non-null object
FAKE_OE2           4024 non-null object
FAKE_OE3           4024 non-null object
DISAG_OE1          4024 non-null object
DISAG_OE2          4024 non-null object
DISAG_OE3          4024 non-null object
DECIS_OE1          4024 non-null object
DECIS_OE2          4024 non-null object
DECIS_OE3          4024 non-null object
weight             4024 non-null float64
tm_start           4024 non-null object
tm_finish          4024 non-null object
duration    

In [13]:
data.iloc[:, 100:200].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 100 columns):
STORIES_c             4024 non-null int64
STORIES_d             4024 non-null int64
STORIES_e             4024 non-null int64
STORIES_f             4024 non-null int64
STORIES_g             4024 non-null int64
STORIES_h             4024 non-null int64
STORIES_Refused       4024 non-null int64
SNSUSE                4024 non-null int64
SNSFREQ               4024 non-null object
FOLLOW                4024 non-null object
FOLLOWANTI            4024 non-null object
SNSSCI                4024 non-null object
SNSCLICK              4024 non-null object
SNSSCIIMP             4024 non-null object
SNSPOST_a             4024 non-null object
SNSPOST_b             4024 non-null object
SNSPOST_c             4024 non-null object
SNSPOST_d             4024 non-null object
SNSPOST_e             4024 non-null object
SNSPOST_f             4024 non-null object
SNSPOST_g             4024 non-null obje

In [15]:
data.iloc[:, 200:500].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4024 entries, 0 to 4023
Data columns (total 24 columns):
DOV_PPC20217    4024 non-null int64
DOV_PPC21501    4024 non-null int64
DOV_PPC11518    4024 non-null int64
ppagecat        4024 non-null int64
ppagect4        4024 non-null int64
PPEDUCAT        4024 non-null int64
PPETHM          4024 non-null int64
PPGENDER        4024 non-null int64
PPHHHEAD        4024 non-null int64
PPHHSIZE        4024 non-null int64
PPHOUSE         4024 non-null int64
PPINCIMP        4024 non-null int64
PPMARIT         4024 non-null int64
PPMSACAT        4024 non-null int64
PPREG4          4024 non-null int64
ppreg9          4024 non-null int64
PPRENT          4024 non-null int64
PPWORK          4024 non-null int64
AGE             4024 non-null int64
PPT01_COL       4024 non-null int64
PPT25_COL       4024 non-null int64
PPT612_COL      4024 non-null int64
PPT1317_COL     4024 non-null int64
PPT18OV_COL     4024 non-null int64
dtypes: int64(24)
memory usag