# Case Study

## Part 1

1. Load in the data in `companies.csv` and `prices.csv` (in the data folder).

In [1]:
import pandas as pd

In [2]:
companies = pd.read_csv('../data/companies.csv')
prices = pd.read_csv('../data/prices.csv')

2. Write a function `is_incorporated` that checks whether an input string, `name`, contains the substring "inc" or "Inc". Its definition should look like this:
```python
def is_incorporated(name):
```
(Yes, all these companies are *technically* incorporated, but bear with us for the exercise.)
<br>
*Hint: you may want to google something like "check if one string is within another Python"*

In [3]:
def is_incorporated(name):
    '''Checks if a company name contains some variant of "inc"'''
    # The 'in' operator checks if one string is within another.
    if 'inc' in name:
        return True
    elif 'Inc' in name:
        return True
    else:
        return False

3. Test this function to be sure it works. Try passing in some strings that contain the substring and some that don't. Test it on data from the companies DataFrame.

In [4]:
companies.head()

Unnamed: 0,Symbol,Name,Sector
0,MMM,3M Company,Industrials
1,AOS,A.O. Smith Corp,Industrials
2,ABT,Abbott Laboratories,Health Care
3,ABBV,AbbVie Inc.,Health Care
4,ACN,Accenture plc,Information Technology


In [5]:
is_incorporated('3M Company')

False

In [6]:
is_incorporated('AbbVie Inc.')

True

In [7]:
is_incorporated('Accenture plc')

False

4. Write a `for` loop to iterate through the elements in the Name column of the companies data, applying `is_incorporated` to each element and printing the result.

In [8]:
# Loop through elements in the "Name" column
for company_name in companies['Name']:
    result = is_incorporated(company_name)
    print(result)

False
False
False
True
False
False
True
True
False
True
False
True
True
True
True
True
True
True
False
True
False
False
False
False
False
False
False
True
True
True
True
False
False
False
False
True
False
True
False
False
True
True
False
False
True
False
False
True
False
False
False
True
True
False
False
True
False
True
True
True
False
True
True
False
False
False
False
True
False
False
False
True
True
False
False
False
True
False
False
False
True
False
False
False
False
True
False
False
False
False
True
True
False
True
False
False
False
False
False
False
True
False
True
False
False
False
False
False
False
False
False
True
False
False
True
False
False
True
False
False
False
False
False
True
False
False
False
False
False
True
False
True
False
True
False
True
False
False
False
False
True
False
True
False
False
True
False
True
True
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
True
False
False
False
False
False
False
False
False
True
False
False
T

5. Now rewrite the code for #4 using the `Series.apply` method -- apply the function to the Series and print the resulting Series.

In [9]:
# Much simpler!
companies['Name'].apply(is_incorporated)

0      False
1      False
2      False
3       True
4      False
       ...  
500     True
501     True
502    False
503    False
504    False
Name: Name, Length: 505, dtype: bool

6. *Similar, but less guided.* Create a new column, name_length, whose value is:
    - `"long"` if the company name is over 12 characters
    - `"medium"` if the company name is 8-11 characters
    - `"short"` if the company name is 7 or fewer characters.

In [10]:
def get_name_length(name):
    '''Determines if a company name is short, medium, or long.'''
    length = len(name)
    if length > 12:
        return 'long'
    elif 8 <= length <= 12:
        return 'medium'
    else: # Name must be less than 8 characters
        return 'short'

In [11]:
# Add a new column using this function
companies['name_length'] = companies['Name'].apply(get_name_length)

In [12]:
companies.head()

Unnamed: 0,Symbol,Name,Sector,name_length
0,MMM,3M Company,Industrials,medium
1,AOS,A.O. Smith Corp,Industrials,long
2,ABT,Abbott Laboratories,Health Care,long
3,ABBV,AbbVie Inc.,Health Care,medium
4,ACN,Accenture plc,Information Technology,long


7. Write a function `make_colname_string` that takes a DataFrame as an argument and returns a string that contains all the DataFrame's columns' names, comma separated. For example, running `make_colname_string` on our companies data would look like this:<br><br>
```python
make_colname_string(companies)
#> 'Symbol,Name,Sector'
```
<br>*Hint: a `for` loop will be helpful.*
<br>Test it on the prices data. What does it return?

In [13]:
def make_colname_string(df):
    '''Concatenate a DataFrame's column names.'''
    # Start with a blank string that we can add to incrementally.
    result_str = ''
    for colname in df.columns:
        # This is tricky! Only add the column if it's not the first element
        if len(result_str) == 0:
            result_str = colname
        else:
            result_str = result_str + ',' + colname
    return result_str

In [14]:
make_colname_string(prices)

'Symbol,Price,Quarter'

### Putting It All Together
Suppose you've discovered a great secret about the stock market: companies with long names (as defined above) are going to double in value after quarter 4 (our most recent data), companies with short names are going to halve in value, and medium-name companies will stay exactly the same. Create a dataset of the form:

| Name | Symbol | Projected |
-------|--------|------------

Where "Projected" is the projected price of the company's stock (2x, 1x, .5x as explained above). Note that you will need to join companies to prices and do some data wrangling operations.

## Part 2

*Modeling*

1. Load in the data in `cars.csv` (in the data folder)

2. Explore the data:
  * How many rows are in the data?
  * What does each row represent?
  * How many columns are in the data?
  * What data type is each column?

3. Set up your model specification to predict the MSRP of each car.
  * What column will be your target?
  * Which columns will be your features?

4. Prepare your data.
  * View the target variable's distribution. Remove outliers if necessary.
  * Engineer your features (encode categorical variables, etc.)
  * Split your data into train and test sets.

5. Train a linear model using `sklearn.linear_model`'s `LinearRegression`.

6. Is this model good?
  * Is it better than any other estimate we have?
  * What's the train RMSE? What's the test RMSE? How do they compare to the standard deviation of the target?
  * Create a predicted vs. actual plot?
  * Which variables have the largest effect size?

*Python Environments*

7. While training and running your model should have been pretty fast, imagine that you are working with orders of magnitude more data -- so you want to train the model overnight, rather than interactively.
  * Export your Jupyter notebook as a `.py` script.
  * Be sure to add `print()` calls so you can see important data.
  * Run your `.py` file from the command line, and verify the results.

8. Think back to Lesson 8, on the data science ecosystem. 
  * Which package sounded most interesting/useful to you? 
  * If you are working on a platform with `conda`, create a new environment called "temporary".
  * Activate the `temporary` environment.
  * Install your package of choice package in `temporary`. 
  * Try to import it in a notebook (remember you'll need to get your conda environment working in Jupyter), and look in the online documentation (just google "package-name docs") to figure out what you can do with this package.