## UFOs and preprocessing
#### Checking column types
Take a look at the UFO dataset's column types using the .info() method. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as object, and the date column, which can be transformed into the datetime type. That will make our feature engineering efforts easier later on.

In [74]:
import pandas as pd
ufo = pd.read_csv('ufo_sightings_large.TXT')
ufo

Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,long
0,11/3/2011 19:21,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,-92.291111
1,10/3/2004 19:05,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,-81.695556
2,9/25/2009 21:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.1200000,-93.287500
3,11/21/2002 05:45,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,-80.382222
4,8/19/2010 12:55,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,-114.083333
...,...,...,...,...,...,...,...,...,...,...,...
4930,7/5/2000 19:30,schnecksville,pa,us,oval,5.0,about 5 seconds,On my bike when i saw a shiny silver oval not ...,7/11/2000,40.6677778,-75.607500
4931,3/18/2008 22:00,gibson,ga,us,triangle,25.0,25 seconds,Three sided stationary object turning clockwi...,3/31/2008,33.2333333,-82.595556
4932,6/15/2005 02:30,kent,wa,us,circle,0.0,early morning,Cicle object over Washington state all differe...,10/30/2006,47.3811111,-122.233611
4933,11/1/1991 03:00,niles,mi,us,triangle,7200.0,2 hours,Triangle zigzagged. Another shined light on u...,9/2/2005,41.8297222,-86.254167


In [75]:
ufo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   date            4935 non-null   object 
 1   city            4926 non-null   object 
 2   state           4516 non-null   object 
 3   country         4255 non-null   object 
 4   type            4776 non-null   object 
 5   seconds         4935 non-null   float64
 6   length_of_time  4792 non-null   object 
 7   desc            4932 non-null   object 
 8   recorded        4935 non-null   object 
 9   lat             4935 non-null   object 
 10  long            4935 non-null   float64
dtypes: float64(2), object(9)
memory usage: 424.2+ KB


In [76]:
# Change the type of seconds to float
ufo["seconds"] = ufo['seconds'].astype('float')

# Change the date column to type datetime
ufo["date"] = pd.to_datetime(ufo['date'])

# Check the column types
print(ufo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4935 entries, 0 to 4934
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   date            4935 non-null   datetime64[ns]
 1   city            4926 non-null   object        
 2   state           4516 non-null   object        
 3   country         4255 non-null   object        
 4   type            4776 non-null   object        
 5   seconds         4935 non-null   float64       
 6   length_of_time  4792 non-null   object        
 7   desc            4932 non-null   object        
 8   recorded        4935 non-null   object        
 9   lat             4935 non-null   object        
 10  long            4935 non-null   float64       
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 424.2+ KB
None


Nice job on transforming the column types! This will make feature engineering and standardization much easier.

#### Dropping missing data
In this exercise, you'll remove some of the rows where certain columns have missing values. You're going to look at the length_of_time column, the state column, and the type column. You'll drop any row that contains a missing value in at least one of these three columns.

In [77]:
print(ufo[['length_of_time', 'state', 'type']].shape, "\n")

# Count the missing values in the length_of_time, state, and type columns, in that order
print(ufo[['length_of_time', 'state', 'type']].isna().sum())

# Drop rows where length_of_time, state, or type are missing
ufo_no_missing = ufo.dropna(subset = ['length_of_time', 'state', 'type'])

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

(4935, 3) 

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


Awesome! We'll work with this set going forward.

## Categorical variables and standardization
#### Extracting numbers from strings
The length_of_time field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.

In [78]:
import re

ufo['length_of_time'] = ufo['length_of_time'].astype(str) # We wrote that code for 'TypeError: expected string or bytes-like object 
                                             # ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)'
def return_minutes(time_string):

    # Search for numbers in time_string
    num = re.search('\d+', time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(return_minutes)

# Take a look at the head of both of the columns
print(ufo[["length_of_time", "minutes"]].head())

    length_of_time  minutes
0          2 weeks      2.0
1           30sec.     30.0
2              nan      NaN
3  about 5 minutes      5.0
4                2      2.0


Nice job! The minutes information is now in a form where it can be inputted into a model.

#### Identifying features for standardization
In this exercise, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the seconds and minutes column, you'll see that the variance of the seconds column is extremely high. Because seconds and minutes are related to each other (an issue we'll deal with when we select features for modeling), let's log normalize the seconds column.

In [79]:
import numpy as np

# Check the variance of the seconds and minutes columns
print(ufo[['seconds','minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo["seconds"])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

seconds    3.156735e+10
minutes    8.425929e+02
dtype: float64
nan


  result = getattr(ufunc, method)(*inputs, **kwargs)


Good work! Now it's time to engineer new features in the ufo dataset.

## Engineering new features
#### Encoding categorical variables
There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.

In [80]:
# Use pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda a: 1 if a == 'us' else 0)
ufo["country_enc"]

0       1
1       1
2       1
3       1
4       0
       ..
4930    1
4931    1
4932    1
4933    1
4934    1
Name: country_enc, Length: 4935, dtype: int64

In [81]:
ufo["country"].unique()

array(['us', 'ca', nan, 'au', 'gb', 'de'], dtype=object)

In [82]:
print(ufo["country"].unique())

# Use pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda a: 1 if a == 'us' else 0 )

# Print the number of unique type values
print(len(ufo['type'].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

ufo.head()

['us' 'ca' nan 'au' 'gb' 'de']
22


Unnamed: 0,date,city,state,country,type,seconds,length_of_time,desc,recorded,lat,...,flash,formation,light,other,oval,rectangle,sphere,teardrop,triangle,unknown
0,2011-11-03 19:21:00,woodville,wi,us,unknown,1209600.0,2 weeks,Red blinking objects similar to airplanes or s...,12/12/2011,44.9530556,...,0,0,0,0,0,0,0,0,0,1
1,2004-10-03 19:05:00,cleveland,oh,us,circle,30.0,30sec.,Many fighter jets flying towards UFO,10/27/2004,41.4994444,...,0,0,0,0,0,0,0,0,0,0
2,2009-09-25 21:00:00,coon rapids,mn,us,cigar,0.0,,Green&#44 red&#44 and blue pulses of light tha...,12/12/2009,45.12,...,0,0,0,0,0,0,0,0,0,0
3,2002-11-21 05:45:00,clemmons,nc,us,triangle,300.0,about 5 minutes,It was a large&#44 triangular shaped flying ob...,12/23/2002,36.0213889,...,0,0,0,0,0,0,0,0,1,0
4,2010-08-19 12:55:00,calgary (canada),ab,ca,oval,0.0,2,A white spinning disc in the shape of an oval.,8/24/2010,51.083333,...,0,0,0,0,1,0,0,0,0,0


Awesome work! Let's continue on by extracting date components.

#### Features from dates
Another feature engineering task to perform is month and year extraction. Perform this task on the date column of the ufo dataset.

In [83]:
# Look at the first 5 rows of the date column
print(ufo['date'].head(),"\n\n")

# Extract the month from the date column
ufo["month"] = ufo["date"].dt.month

# Extract the year from the date column
ufo["year"] = ufo["date"].dt.year

# Take a look at the head of all three columns
print(ufo[["date","month","year"]].head())

0   2011-11-03 19:21:00
1   2004-10-03 19:05:00
2   2009-09-25 21:00:00
3   2002-11-21 05:45:00
4   2010-08-19 12:55:00
Name: date, dtype: datetime64[ns] 


                 date  month  year
0 2011-11-03 19:21:00     11  2011
1 2004-10-03 19:05:00     10  2004
2 2009-09-25 21:00:00      9  2009
3 2002-11-21 05:45:00     11  2002
4 2010-08-19 12:55:00      8  2010


Nice job on extracting dates! The pandas series attributes .dt.month and .dt.year are extremely useful for extraction tasks.

#### Text vectorization
You'll now transform the desc column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.

In [84]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Take a look at the head of the desc field
print(ufo['desc'].head(),"\n\n")

# Instantiate the tfidf vectorizer object
vec = TfidfVectorizer()

ufo.dropna(subset=['desc'], inplace=True)# If we didn't do this, it would give 
                                         # ValueError: np.nan is an invalid document, expected byte or unicode string. 
                                         # ---> desc_tfidf = vec.fit_transform(ufo['desc'])

# Fit and transform desc using vec
desc_tfidf = vec.fit_transform(ufo['desc'])

# Look at the number of columns and rows
print(desc_tfidf)

0    Red blinking objects similar to airplanes or s...
1                 Many fighter jets flying towards UFO
2    Green&#44 red&#44 and blue pulses of light tha...
3    It was a large&#44 triangular shaped flying ob...
4       A white spinning disc in the shape of an oval.
Name: desc, dtype: object 


  (0, 599)	0.3196929518112806
  (0, 6204)	0.32725095835678897
  (0, 5937)	0.16053126945167856
  (0, 5223)	0.10691429469459467
  (0, 3091)	0.08229067463398756
  (0, 757)	0.1948218553551395
  (0, 3964)	0.3652351192583987
  (0, 3861)	0.23436398000967731
  (0, 4027)	0.19655659663513286
  (0, 2023)	0.3133027483289482
  (0, 5691)	0.1478328149366558
  (0, 5433)	0.21336485559773063
  (0, 4174)	0.18489753495978478
  (0, 618)	0.2985170498722688
  (0, 5771)	0.1137690531662858
  (0, 5185)	0.2876409315293689
  (0, 4080)	0.16409775986675704
  (0, 1074)	0.22710102287538758
  (0, 4715)	0.13709997670868124
  (1, 5946)	0.27013857869066665
  (1, 5802)	0.4048104966318018
  (1, 2536)	0.27872791703003025
  (1

Great! You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.

## Feature selection and modeling

#### Selecting the ideal dataset
Now to get rid of some of the unnecessary features in the ufo dataset. Because the country column has been encoded as country_enc, you can select it and drop the other columns related to location: city, country, lat, long, and state.

You've engineered the month and year columns, so you no longer need the date or recorded columns. You also standardized the seconds column as seconds_log, so you can drop seconds and minutes.

You vectorized desc, so it can be removed. For now you'll keep type.

You can also get rid of the length_of_time column, which is unnecessary after extracting minutes.

In [85]:
vocab = {1048: 'web',
 278: 'designer',
 1017: 'urban',
 38: 'adventures',
 490: 'ice',
 890: 'skating',
 90: 'at',
 559: 'lasker',
 832: 'rink',
 368: 'fight',
 423: 'global',
 487: 'hunger',
 68: 'and',
 944: 'support',
 1061: 'women',
 356: 'farmers',
 535: 'join',
 969: 'the',
 708: 'oxfam',
 27: 'action',
 240: 'corps',
 498: 'in',
 680: 'nyc',
 922: 'stop',
 947: 'swap',
 790: 'queens',
 911: 'staff',
 281: 'development',
 992: 'trainer',
 200: 'claro',
 145: 'brooklyn',
 1037: 'volunteer',
 93: 'attorney',
 221: 'community',
 455: 'health',
 43: 'advocates',
 942: 'supervise',
 189: 'children',
 466: 'highland',
 717: 'park',
 409: 'garden',
 1071: 'worldofmoney',
 696: 'org',
 1085: 'youth',
 60: 'amazing',
 791: 'race',
 789: 'qualified',
 133: 'board',
 620: 'member',
 860: 'seats',
 98: 'available',
 1083: 'young',
 33: 'adult',
 1006: 'tutor',
 1016: 'updated',
 11: '30',
 0: '11',
 513: 'insurance',
 199: 'claims',
 600: 'manager',
 979: 'timebanksnyc',
 432: 'great',
 340: 'exchange',
 205: 'clean',
 1015: 'up',
 81: 'asbury',
 171: 'cementary',
 918: 'staten',
 524: 'island',
 869: 'senior',
 194: 'citizen',
 392: 'friendly',
 1033: 'visitor',
 881: 'shop',
 1000: 'tree',
 161: 'care',
 1068: 'workshop',
 4: '20',
 646: 'movie',
 856: 'screener',
 380: 'for',
 870: 'seniors',
 355: 'farm',
 430: 'graphic',
 691: 'open',
 480: 'house',
 416: 'get',
 984: 'tools',
 980: 'to',
 806: 'recycling',
 1039: 'volunteers',
 660: 'needed',
 353: 'family',
 336: 'event',
 207: 'clerical',
 158: 'cancer',
 1041: 'walk',
 120: 'befitnyc',
 739: 'physical',
 30: 'activity',
 700: 'organizers',
 269: 'decision',
 266: 'day',
 5: '2011',
 661: 'needs',
 1084: 'your',
 459: 'help',
 405: 'gain',
 1021: 'valuable',
 245: 'counseling',
 344: 'experience',
 687: 'on',
 845: 'samaritans',
 9: '24',
 479: 'hour',
 255: 'crisis',
 478: 'hotline',
 457: 'heart',
 407: 'gallery',
 703: 'our',
 503: 'info',
 949: 'table',
 373: 'finding',
 471: 'homes',
 542: 'kids',
 1077: 'yiddish',
 903: 'speaking',
 472: 'homework',
 460: 'helper',
 892: 'skilled',
 800: 'rebuilding',
 982: 'together',
 468: 'home',
 818: 'repairs',
 438: 'greenteam',
 40: 'advetures',
 940: 'summer',
 931: 'streets',
 1005: 'tuesday',
 335: 'evenings',
 1060: 'with',
 612: 'masa',
 594: 'lunch',
 770: 'program',
 1018: 'us',
 706: 'outreach',
 618: 'meals',
 760: 'preparedness',
 222: 'compost',
 773: 'project',
 613: 'master',
 223: 'composter',
 178: 'certificate',
 249: 'course',
 318: 'emblemhealth',
 144: 'bronx',
 683: 'of',
 873: 'service',
 531: 'jcc',
 601: 'manhattan',
 418: 'girl',
 855: 'scout',
 872: 'series',
 296: 'dorot',
 838: 'rosh',
 452: 'hashanah',
 709: 'package',
 274: 'delivery',
 713: 'painting',
 511: 'instructor',
 530: 'jasa',
 464: 'hes',
 172: 'center',
 12: '3rd',
 70: 'annual',
 377: 'flyny',
 548: 'kite',
 366: 'festival',
 983: 'tomorrow',
 151: 'business',
 566: 'leaders',
 955: 'teach',
 110: 'basics',
 465: 'high',
 852: 'schoolers',
 410: 'gardening',
 397: 'ft',
 1004: 'tryon',
 910: 'st',
 610: 'martin',
 748: 'poetry',
 668: 'new',
 1079: 'york',
 216: 'college',
 424: 'goal',
 941: 'sunday',
 361: 'february',
 6: '2012',
 262: 'dance',
 8: '22nd',
 560: 'latino',
 604: 'march',
 2: '17',
 1013: 'university',
 848: 'saturday',
 1008: 'tutors',
 744: 'planet',
 485: 'human',
 602: 'mapping',
 420: 'give',
 1050: 'week',
 186: 'child',
 569: 'learn',
 796: 'read',
 926: 'storytelling',
 243: 'costume',
 597: 'making',
 912: 'stage',
 277: 'design',
 319: 'emergency',
 351: 'fair',
 17: '9th',
 1053: 'west',
 887: 'side',
 248: 'county',
 676: 'nutrition',
 314: 'educator',
 879: 'shape',
 306: 'east',
 13: '54st',
 801: 'rec',
 1046: 'water',
 45: 'aerobics',
 83: 'asser',
 573: 'levy',
 712: 'paint',
 57: 'alongside',
 783: 'publicolor',
 936: 'students',
 536: 'jumpstart',
 797: 'readers',
 564: 'lead',
 252: 'crafts',
 408: 'games',
 348: 'face',
 751: 'popcorn',
 527: 'jackie',
 835: 'robinson',
 716: 'parent',
 375: 'fitness',
 916: 'starrett',
 197: 'city',
 585: 'line',
 263: 'dancer',
 615: 'math',
 587: 'literacy',
 114: 'be',
 209: 'climb',
 985: 'top',
 608: 'marketing',
 86: 'assistant',
 313: 'education',
 673: 'nonprofit',
 867: 'seeks',
 805: 'recruitment',
 626: 'mentors',
 810: 'register',
 92: 'attend',
 142: 'breakfast',
 701: 'orientation',
 529: 'january',
 272: 'deliver',
 1058: 'winter',
 1031: 'visit',
 65: 'an',
 525: 'isolated',
 342: 'exercise',
 213: 'coach',
 670: 'night',
 115: 'beach',
 180: 'change',
 77: 'art',
 772: 'programs',
 229: 'consumer',
 779: 'protection',
 562: 'law',
 589: 'liver',
 579: 'life',
 565: 'leader',
 901: 'soup',
 547: 'kitchen',
 307: 'eastern',
 534: 'john',
 650: 'muir',
 930: 'street',
 1024: 'vendor',
 641: 'monthly',
 959: 'team',
 367: 'fiesta',
 977: 'throgs',
 658: 'neck',
 224: 'computer',
 956: 'teacher',
 567: 'leadership',
 244: 'council',
 693: 'opportunity',
 231: 'conversation',
 461: 'helpers',
 427: 'grades',
 714: 'pantry',
 288: 'distribution',
 305: 'earth',
 960: 'tech',
 1049: 'website',
 692: 'opportunities',
 175: 'cents',
 19: 'ability',
 203: 'classroom',
 877: 'set',
 146: 'brush',
 545: 'kindness',
 999: 'transportation',
 58: 'alternatives',
 129: 'bike',
 1020: 'valet',
 1026: 'video',
 311: 'editing',
 767: 'professionals',
 921: 'stipend',
 49: 'after',
 851: 'school',
 624: 'mentor',
 666: 'networking',
 138: 'bowling',
 398: 'fun',
 449: 'harlem',
 555: 'lanes',
 866: 'seeking',
 1078: 'yoga',
 902: 'spanish',
 695: 'or',
 389: 'french',
 362: 'feed',
 488: 'hungry',
 1080: 'yorkers',
 14: '55',
 690: 'only',
 735: 'phone',
 106: 'bank',
 819: 'representative',
 795: 'reach',
 704: 'out',
 643: 'morris',
 458: 'heights',
 904: 'special',
 155: 'camp',
 946: 'susan',
 551: 'komen',
 259: 'cure',
 433: 'greater',
 47: 'affiliate',
 303: 'dumbo',
 79: 'arts',
 698: 'organizational',
 148: 'budget',
 639: 'money',
 596: 'makes',
 871: 'sense',
 994: 'training',
 889: 'site',
 1027: 'videographer',
 376: 'fly',
 152: 'by',
 970: 'theater',
 429: 'grant',
 1074: 'writer',
 745: 'planning',
 778: 'proposal',
 759: 'preparation',
 399: 'fund',
 793: 'raising',
 450: 'harm',
 808: 'reduction',
 35: 'adv',
 515: 'intern',
 875: 'serving',
 575: 'lgbt',
 34: 'adults',
 482: 'how',
 830: 'ride',
 130: 'bikes',
 821: 'research',
 401: 'fundraising',
 280: 'developement',
 233: 'cook',
 840: 'row',
 50: 'afterschool',
 630: 'middle',
 885: 'shower',
 400: 'fundraisers',
 526: 'it',
 519: 'interpreters',
 563: 'lawyers',
 446: 'haitian',
 18: 'abe',
 757: 'pre',
 412: 'ged',
 640: 'monitor',
 89: 'astoria',
 634: 'million',
 1001: 'trees',
 421: 'giveaway',
 290: 'do',
 1081: 'you',
 1044: 'want',
 595: 'make',
 283: 'difference',
 204: 'classwish',
 896: 'snow',
 883: 'shoveling',
 196: 'citizenship',
 761: 'press',
 586: 'list',
 781: 'public',
 813: 'relations',
 743: 'plan',
 829: 'review',
 394: 'friendship',
 753: 'positive',
 121: 'beginnings',
 546: 'kit',
 611: 'mary',
 803: 'recreation',
 291: 'does',
 697: 'organization',
 659: 'need',
 858: 'search',
 928: 'strategy',
 332: 'esl',
 46: 'affected',
 924: 'storm',
 995: 'transform',
 590: 'lives',
 933: 'strengthen',
 220: 'communities',
 119: 'become',
 302: 'driver',
 1025: 'veterans',
 191: 'chinese',
 997: 'translator',
 512: 'instructors',
 653: 'museum',
 621: 'membership',
 275: 'department',
 284: 'director',
 117: 'beautify',
 996: 'transitional',
 822: 'residence',
 470: 'homeless',
 623: 'men',
 953: 'tank',
 517: 'internship',
 774: 'projects',
 841: 'run',
 1056: 'wild',
 139: 'boys',
 475: 'hope',
 419: 'girls',
 219: 'communications',
 792: 'raise',
 100: 'awareness',
 31: 'administrative',
 56: 'alliance',
 811: 'registrar',
 647: 'ms',
 1062: 'word',
 162: 'career',
 246: 'counselor',
 722: 'passover',
 304: 'early',
 188: 'childhood',
 149: 'build',
 747: 'plastic',
 137: 'bottle',
 857: 'sculpture',
 763: 'pride',
 523: 'is',
 538: 'just',
 76: 'around',
 238: 'corner',
 520: 'involved',
 675: 'now',
 390: 'fresh',
 53: 'air',
 957: 'teachers',
 372: 'find',
 729: 'perfect',
 533: 'job',
 684: 'office',
 1075: 'writing',
 264: 'data',
 326: 'entry',
 29: 'activism',
 738: 'photography',
 843: 'salesforce',
 265: 'database',
 261: 'customization',
 736: 'photo',
 333: 'essay',
 572: 'legal',
 42: 'advisor',
 467: 'hike',
 974: 'thon',
 236: 'coordinator',
 558: 'laser',
 950: 'tag',
 298: 'dowling',
 3: '175th',
 505: 'information',
 962: 'technology',
 352: 'fall',
 382: 'forest',
 826: 'restoration',
 541: 'kickoff',
 1002: 'trevor',
 582: 'lifeline',
 247: 'counselors',
 973: 'thomas',
 532: 'jefferson',
 614: 'materials',
 1076: 'year',
 386: 'founder',
 341: 'executive',
 453: 'haunted',
 557: 'lantern',
 989: 'tours',
 383: 'fort',
 986: 'totten',
 657: 'national',
 878: 'sexual',
 82: 'assault',
 689: 'online',
 993: 'trainers',
 48: 'african',
 63: 'american',
 210: 'clothing',
 301: 'drive',
 828: 'returning',
 865: 'seeds',
 939: 'success',
 746: 'plant',
 981: 'today',
 443: 'growth',
 1009: 'udec',
 328: 'enviromedia',
 636: 'mobile',
 606: 'maritime',
 102: 'bacchanal',
 742: 'pirates',
 365: 'fest',
 492: 'ikea',
 329: 'erie',
 111: 'basin',
 282: 'diabetes',
 88: 'association',
 364: 'feria',
 267: 'de',
 844: 'salud',
 664: 'nepali',
 105: 'bangla',
 784: 'punjabi',
 998: 'translators',
 674: 'not',
 769: 'profit',
 741: 'pioneer',
 159: 'capoeira',
 1023: 'various',
 752: 'positions',
 287: 'dispatcher',
 991: 'trainee',
 506: 'ing',
 603: 'marathon',
 388: 'free',
 593: 'love',
 135: 'books',
 268: 'dear',
 96: 'authors',
 52: 'aide',
 850: 'scheuer',
 627: 'merchandise',
 293: 'donate',
 943: 'supplies',
 360: 'feast',
 406: 'gala',
 112: 'battery',
 833: 'rise',
 919: 'stay',
 787: 'put',
 820: 'rescue',
 897: 'soccer',
 402: 'futsal',
 730: 'performing',
 36: 'advanced',
 202: 'classes',
 1070: 'world',
 854: 'science',
 1054: 'western',
 64: 'americorps',
 25: 'aces',
 310: 'economic',
 864: 'security',
 507: 'initiative',
 331: 'esi',
 633: 'mill',
 173: 'centers',
 631: 'midtown',
 1088: 'zumba',
 1030: 'vision',
 635: 'mission',
 66: 'analysis',
 552: 'lab',
 958: 'teaching',
 84: 'assist',
 827: 'resume',
 150: 'building',
 899: 'society',
 214: 'coaches',
 1040: 'vs',
 218: 'committee',
 842: 'russian',
 385: 'foster',
 170: 'celebration',
 616: 'may',
 7: '21th',
 688: 'one',
 711: 'pager',
 294: 'donation',
 489: 'hurricane',
 521: 'irene',
 354: 'far',
 836: 'rockaway',
 325: 'enjoy',
 1066: 'working',
 686: 'olympics',
 988: 'tournament',
 798: 'reading',
 719: 'partners',
 234: 'cooper',
 909: 'square',
 975: 'thrift',
 908: 'spring',
 166: 'case',
 599: 'management',
 404: 'fvcp',
 990: 'trail',
 254: 'crew',
 447: 'halloween',
 165: 'carnival',
 1042: 'walkathon',
 359: 'feasibility',
 67: 'analyst',
 749: 'police',
 868: 'seminar',
 1064: 'work',
 1035: 'visually',
 496: 'impaired',
 964: 'teens',
 972: 'this',
 322: 'energy',
 315: 'efficiency',
 321: 'end',
 859: 'season',
 156: 'campaign',
 123: 'benefits',
 802: 'reception',
 300: 'drill',
 237: 'copywriting',
 235: 'coord',
 454: 'have',
 725: 'penchant',
 55: 'all',
 971: 'things',
 1028: 'vintage',
 976: 'thriftshop',
 718: 'partner',
 726: 'pencil',
 720: 'partnership',
 710: 'packing',
 16: '8th',
 907: 'sports',
 346: 'expo',
 164: 'cares',
 184: 'cheerleaders',
 1045: 'wanted',
 445: 'habitat',
 371: 'finance',
 215: 'coffee',
 324: 'english',
 755: 'practice',
 570: 'learners',
 456: 'healthy',
 28: 'active',
 978: 'time',
 122: 'benefit',
 73: 'april',
 357: 'fashion',
 929: 'strawberry',
 87: 'assistants',
 174: 'central',
 1087: 'zoo',
 1: '125th',
 127: 'bideawee',
 440: 'greeters',
 592: 'looking',
 799: 'real',
 495: 'impact',
 504: 'inform',
 728: 'people',
 756: 'practices',
 580: 'lifebeat',
 413: 'general',
 932: 'streetsquash',
 286: 'discovery',
 874: 'services',
 663: 'neighborhood',
 768: 'profiles',
 951: 'take',
 915: 'stand',
 51: 'against',
 1029: 'violence',
 345: 'expert',
 41: 'advice',
 537: 'june',
 849: 'schedule',
 258: 'crowdfunding',
 727: 'penny',
 451: 'harvest',
 434: 'green',
 185: 'chefs',
 677: 'nutritionists',
 379: 'foodies',
 625: 'mentoring',
 136: 'boom',
 669: 'newsletter',
 217: 'come',
 934: 'strides',
 1043: 'walks',
 187: 'childcare',
 898: 'social',
 619: 'media',
 422: 'giving',
 157: 'can',
 61: 'ambassador',
 10: '2nd',
 967: 'thanksgiving',
 363: 'feeding',
 662: 'needy',
 782: 'publicity',
 723: 'patient',
 163: 'caregiver',
 1032: 'visiting',
 469: 'homebound',
 358: 'fc',
 679: 'nyawc',
 384: 'forum',
 21: 'about',
 1038: 'volunteering',
 809: 'refreshments',
 847: 'sara',
 837: 'roosevelt',
 206: 'cleanup',
 116: 'beautification',
 337: 'events',
 69: 'animal',
 484: 'hudson',
 834: 'river',
 605: 'mariners',
 825: 'response',
 343: 'exhibit',
 20: 'aboard',
 584: 'lilac',
 208: 'client',
 1052: 'welcome',
 279: 'desk',
 685: 'older',
 574: 'lexington',
 251: 'craft',
 750: 'poll',
 1065: 'workers',
 518: 'interperters',
 24: 'accounting',
 85: 'assistance',
 477: 'hosting',
 776: 'promotion',
 1011: 'unicef',
 954: 'tap',
 814: 'release',
 270: 'dedication',
 771: 'programming',
 500: 'incarnation',
 295: 'donor',
 544: 'kieran',
 906: 'sponsorship',
 1069: 'workshops',
 118: 'because',
 338: 'every',
 276: 'deserves',
 179: 'chance',
 740: 'pin',
 273: 'delivered',
 886: 'shred',
 15: '5th',
 99: 'avenue',
 169: 'cdsc',
 917: 'starving',
 78: 'artist',
 884: 'show',
 948: 'system',
 396: 'front',
 880: 'share',
 553: 'lanch',
 935: 'student',
 463: 'hemophilia',
 577: 'liason',
 629: 'methodist',
 476: 'hospital',
 113: 'bay',
 831: 'ridge',
 124: 'benonhurst',
 75: 'area',
 900: 'sought',
 97: 'autistic',
 297: 'douglaston',
 788: 'qns',
 812: 'registration',
 32: 'administrator',
 153: 'call',
 426: 'governor',
 804: 'recruiter',
 786: 'purim',
 327: 'envelope',
 938: 'stuffing',
 528: 'jam',
 462: 'helpline',
 923: 'store',
 374: 'first',
 415: 'generation',
 1022: 'van',
 241: 'cortlandt',
 816: 'remembrance',
 945: 'survey',
 823: 'resonations',
 143: 'breast',
 323: 'engine',
 694: 'optimization',
 622: 'memorial',
 894: 'sloan',
 540: 'kettering',
 435: 'greenhouse',
 436: 'greening',
 227: 'concert',
 334: 'evacuation',
 824: 'resources',
 417: 'gift',
 126: 'bicycling',
 656: 'my',
 393: 'friends',
 473: 'honor',
 1051: 'weekend',
 731: 'person',
 651: 'mural',
 312: 'editor',
 732: 'personal',
 882: 'shopper',
 764: 'pro',
 134: 'bono',
 253: 'create',
 160: 'cards',
 920: 'step',
 672: 'non',
 780: 'provider',
 516: 'interns',
 645: 'motion',
 431: 'graphics',
 125: 'best',
 147: 'buddies',
 502: 'inern',
 103: 'back',
 588: 'little',
 242: 'cosmetologist',
 107: 'barber',
 1036: 'vocational',
 72: 'apartment',
 439: 'greeter',
 766: 'professional',
 1019: 'use',
 893: 'skills',
 702: 'others',
 369: 'figure',
 257: 'croton',
 190: 'chinatown',
 193: 'ci',
 758: 'prep',
 239: 'corporate',
 1063: 'wordpress',
 132: 'blog',
 510: 'instructer',
 807: 'red',
 474: 'hook',
 289: 'divert',
 966: 'textiles',
 395: 'from',
 554: 'landfill',
 437: 'greenmarket',
 965: 'textile',
 154: 'calling',
 195: 'citizens',
 497: 'improve',
 26: 'achievement',
 721: 'passion',
 481: 'housing',
 1067: 'works',
 499: 'inc',
 441: 'group',
 299: 'drama',
 561: 'laundromats',
 320: 'employment',
 927: 'strategic',
 667: 'never',
 104: 'bad',
 391: 'friend',
 403: 'future',
 201: 'class',
 1059: 'wish',
 387: 'fpcj',
 1072: 'worship',
 1010: 'undergraduate',
 428: 'graduate',
 228: 'conference',
 1047: 'we',
 775: 'promote',
 550: 'knowledge',
 715: 'parade',
 74: 'archivist',
 425: 'google',
 44: 'adwords',
 493: 'imentor',
 642: 'more',
 598: 'male',
 632: 'miles',
 637: 'moms',
 183: 'charity',
 176: 'century',
 987: 'tour',
 198: 'civil',
 724: 'patrol',
 62: 'america',
 539: 'kept',
 862: 'secret',
 648: 'ms131',
 549: 'knitter',
 256: 'crochet',
 131: 'blankets',
 177: 'ceo',
 591: 'logo',
 1012: 'unique',
 1057: 'will',
 128: 'big',
 37: 'adventure',
 23: 'accountant',
 876: 'session',
 888: 'single',
 644: 'mothers',
 192: 'choice',
 895: 'smc',
 1055: 'wii',
 705: 'outdoor',
 671: 'nights',
 607: 'market',
 514: 'intake',
 638: 'monday',
 141: 'branding',
 140: 'brand',
 491: 'identity',
 649: 'mt',
 1086: 'zion',
 543: 'kidz',
 817: 'reorganize',
 578: 'library',
 378: 'food',
 91: 'athletic',
 568: 'league',
 655: 'musician',
 59: 'alzheimer',
 654: 'music',
 109: 'bash',
 765: 'proctor',
 952: 'taking',
 339: 'exams',
 777: 'promotional',
 }

In [86]:
# Add in the rest of the arguments
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    
    # Transform that zipped dict into a series
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    
    # Sort the series to pull out the top n weighted words
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]


def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    for i in range(0, vector.shape[0]):
    
        # Call the return_weights function and extend filter_list
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    # Return the list in a set, so we don't get duplicate word indices
    return set(filter_list)

# Make a list of features to drop   
to_drop = ["city", "country", "date", "desc", "lat", "length_of_time", "long", "minutes", "recorded", "seconds", "state"]

# Drop those features
ufo_dropped = ufo.drop(to_drop, axis=1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

KeyError: 6204

Great job! You're almost done. In the next exercises, you'll model the UFO data in a couple of different ways.

In [91]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# X = ufo.loc[:,['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone', 'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash', 'formation', 'light', 'other', 'oval', 'rectangle',
           #'sphere', 'teardrop', 'triangle', 'unknown', 'month', 'year']]
    
X = ufo.drop(['date', 'city', 'state', 'country', 'type', 'seconds', 'length_of_time',
       'desc', 'recorded', 'lat', 'long', 'minutes', 'seconds_log',
       'country_enc'], axis = 1)

y = ufo.loc[:,["country_enc"]]

# Split the X and y sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 42)

knn = KNeighborsClassifier()

# Fit knn to the training sets
knn.fit(X_train, y_train)

# Print the score of knn on the test sets
print(knn.score(X_test, y_test))

  return self._fit(X, y)


0.7591240875912408


Awesome work! This model performs pretty well. It seems like you've made pretty good feature selection choices here.

#### Modeling the UFO dataset, part 2
Finally, you'll build a model using the text vector we created, desc_tfidf, using the filtered_words list to create a filtered text vector. Let's see if you can predict the type of the sighting based on the text. You'll use a Naive Bayes model for this.

In [93]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
X_train, X_test, y_train, y_test = train_test_split(filtered_text.toarray(), y, stratify=y, random_state=42)

# Fit nb to the training sets
nb.fit(X_train, y_train)

# Print the score of nb on the test sets
print(nb.score(X_test, y_test))

NameError: name 'filtered_words' is not defined

<script.py> output:
    0.17987152034261242

Congrats, you've completed the course! As you can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting type.