We want to read in a set of data from a CSV file. The data has units as a table in another CSV file and a third CSV file which contains the column names.

We're going to use Pint to try and add units to our dataframe. Pint adds units to Numpy. It also has an experimental version that works with Pandas.

https://pint.readthedocs.io/en/0.11

https://pint.readthedocs.io/en/0.11/pint-pandas.html

Install pint with `python -m pip install --user git+https://github.com/hgrecco/pint-pandas.git`

In [8]:
import pandas as pd
import pint
import doctest

In [6]:
#!cat ../data/R_headerfile.csv
#!cat ../data/R_headerunitfile.csv
!cat ../data/R_dataframe.csv

Material 1, 70000.000000, 0.330000, 200.000000, 0.008300, 0.200000, 79.863437, 0.560000, 1, 1.000000, 12.000000, 50.000000, 69936.347436, 0.330000, 79.893583, 119.587222, A1, R_full, 5.5999999999997274180429940e-01, 5.5999999999999983124610026e-01, 5.6000161021271344097272049e-01, 5.5999999999999972022379779e-01, 5.6000059157474524340614153e-01, 5.6000000000000005329070518e-01, 5.5999999999999994226840272e-01, 5.6000121448417494729454802e-01, 1.5847555195116283521805656e-01
Material 1, 70000.000000, 0.330000, 200.000000, 0.008300, 0.200000, 79.863437, 0.560000, 1, 1.000000, 12.000000, 50.000000, 69936.347436, 0.330000, 79.893583, 119.587222, A1, R_1, 5.6000000000000194066984704e-01, 5.5999999999999983124610026e-01, 5.6000450601771223357872032e-01, 5.5999999999999972022379779e-01, 5.6000120556782184699784466e-01, 5.6000000000000005329070518e-01, 5.6000000000000005329070518e-01, 5.6000332819306641862766583e-01, 1.2413804670039763067279637e-01
Material 1, 70000.000000, 0.330000, 200.000

Material 3, 70000.000000, 0.330000, 550.000000, 0.003600, 0.250000, 149.870095, 0.750000, 6, 1.000000, 12.000000, 50.000000, 69888.229568, 0.330000, 149.976306, 303.417065, A3, R_full, 7.5000000000000066613381478e-01, 8.6216727981736041019900085e-01, 8.6221829073495492856693545e-01, 9.8091733336660613673529951e-01, 9.8095644395538883486551640e-01, 9.3766621974406816342195725e-01, 9.3766621974406816342195725e-01, 9.3771646581881407112035731e-01, 2.2594980818668350397437905e-01
Material 3, 70000.000000, 0.330000, 550.000000, 0.003600, 0.250000, 149.870095, 0.750000, 6, 1.000000, 12.000000, 50.000000, 69888.229568, 0.330000, 149.976306, 303.417065, A3, R_1, 5.3089828078534351263328972e-01, 6.0825296369039527633049147e-01, 6.0828704041042147565576670e-01, 7.0291083237863405397405359e-01, 7.0293713094886067782596228e-01, 6.6303014247058600361128811e-01, 6.6303014247058589258898564e-01, 6.6306631211085687027662061e-01, 2.0062868120007332217724638e-01
Material 3, 70000.000000, 0.330000, 550

Material 6, 70000.000000, 0.360000, 750.000000, 0.003400, 0.160000, 323.330443, 0.880000, 5, 1.000000, 12.000000, 50.000000, 69740.476952, 0.330000, 323.870504, 476.932811, E, R_4, 7.6908526256352327532539448e-01, 7.6740846726794598176013551e-01, 7.7036151407565134352495306e-01, 6.5348318408175920524172398e-01, 6.5280109467160640779326286e-01, 7.5173913043478257645091389e-01, 7.5173913043478257645091389e-01, 7.5420748918142954675403189e-01, 1.5674246471229250077250583e-01
Material 6, 70000.000000, 0.360000, 750.000000, 0.003400, 0.160000, 323.330443, 0.880000, 5, 1.000000, 12.000000, 50.000000, 69740.476952, 0.330000, 323.870504, 476.932811, F, R_full, 8.8004655362130745910320684e-01, 7.9535239709718641432090180e-01, 8.0001833245746634126760455e-01, 7.2844071248629038706212668e-01, 7.2800123135631800153078075e-01, 7.5173913043478257645091389e-01, 7.5173913043478257645091389e-01, 7.5445563127665293823298498e-01, 1.4093940315390379725002390e-01
Material 6, 70000.000000, 0.360000, 750.0

Material 9, 200000.000000, 0.300000, 1500.000000, 0.003000, 0.220000, 465.762370, 0.800000, 3, 1.000000, 12.000000, 50.000000, 199665.850674, 0.300000, 466.088762, 863.934887, F, R_2, 7.5009328358209026443859102e-01, 7.6013383531095435330371402e-01, 7.5857991045024353304881970e-01, 9.1777678078654811866243790e-01, 9.1462893766980057908000390e-01, 8.0000000000000004440892099e-01, 8.0000000000000015543122345e-01, 7.9809773935158889734964305e-01, 2.1234867870098339537321408e-01
Material 9, 200000.000000, 0.300000, 1500.000000, 0.003000, 0.220000, 465.762370, 0.800000, 3, 1.000000, 12.000000, 50.000000, 199665.850674, 0.300000, 466.088762, 863.934887, F, R_3, 8.5009328358208946507801329e-01, 8.5614347444857430424747236e-01, 8.5375719663579896501204303e-01, 1.1433115211319397896971850e+00, 1.1377843632476349888804634e+00, 9.0000000000000013322676296e-01, 9.0000000000000013322676296e-01, 8.9721261329048584975964786e-01, 2.1573311539770423372885944e-01
Material 9, 200000.000000, 0.300000, 1

One check that needs to be performed is to confirm that the dataframes with the units and the column names all have the same rows. I'm goint to create a function called `same_rows` to do this. It has some test cases associcaed with it that I check with the `doctest` module.

In [9]:
def same_rows(df_):
    """Assert that all values are the same for each column
    
    Args:
      df_: Pandas dataframe
      
    Returns:
      None
    
    >>> df0 = pd.DataFrame(dict(a=[1, 1], b=[2, 2]))
    >>> same_rows(df0)
    
    >>> df1 = pd.DataFrame(dict(a=[1, 1], b=[2, 1]))
    >>> same_rows(df1)
    Traceback (most recent call last):
    ...
    AssertionError: Some rows are not all the same
    
    >>> df2 = pd.DataFrame(dict(a=[1, 2], b=[2, 1]))
    >>> same_rows(df2)
    Traceback (most recent call last):
    ...
    AssertionError: Some rows are not all the same
    
    """
    assert df_.apply(lambda x: len(x.unique()) == 1).all(), "Some rows are not all the same"
    
doctest.testmod()

TestResults(failed=0, attempted=6)

Now we can read in the helper dataframes (not the actual data). Notice that I'm using a slightly different separator which allows a comma plus any amount of white space using a regex. Also these CSV files don't have a header line.

In [12]:
df_header = pd.read_csv('../data/R_headerfile.csv', sep=",\s+", header=None, engine="python")
df_units = pd.read_csv('../data/R_headerunitfile.csv', sep=",\s+", header=None, engine="python")
    
same_rows(df_header)
same_rows(df_units)

Let's read in the data and use `df_header` to set the column names. This dataframe has no units associated with it.

In [14]:
tuple(zip(df_header.iloc[0], df_units.loc[0, :].str.replace('na', '1')))

(('MatID', '1'),
 ('E', 'MPa'),
 ('nu', '1'),
 ('K', 'MPa'),
 ('epsilon_o', 'm/m'),
 ('n_value', '1'),
 ('s_x_yield', 'MPa'),
 ('r_value', '1'),
 ('r_function', '1'),
 ('to', 'mm'),
 ('wo', 'mm'),
 ('lo', 'mm'),
 ('mE', 'MPa'),
 ('mnu', '1'),
 ('Rp02_data', 'MPa'),
 ('UTS_data', 'MPa'),
 ('Method', '1'),
 ('Range', '1'),
 ('r_applied_avg_ref', '1'),
 ('r_ISO_intercept_zero_ref', '1'),
 ('r_ISO_intercept_zero_data', '1'),
 ('r_ISO_intercept_notzero_ref', '1'),
 ('r_ISO_intercept_notzero_data', '1'),
 ('r_applied_point_ref', '1'),
 ('r_single_point_ref', '1'),
 ('r_single_point_data', '1'),
 ('n_value_data', '1'))

In [15]:
names = tuple(zip(df_header.iloc[0], df_units.loc[0, :].str.replace('na', '1')))
df = pd.read_csv('../data/R_dataframe.csv', names=names, header=None)

In [22]:
df.loc[:, 'E']

Unnamed: 0,MPa
0,70000.0
1,70000.0
2,70000.0
3,70000.0
4,70000.0
...,...
3895,200000.0
3896,200000.0
3897,200000.0
3898,200000.0


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 27 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   (MatID, 1)                         3900 non-null   object 
 1   (E, MPa)                           3900 non-null   float64
 2   (nu, 1)                            3900 non-null   float64
 3   (K, MPa)                           3900 non-null   float64
 4   (epsilon_o, m/m)                   3900 non-null   float64
 5   (n_value, 1)                       3900 non-null   float64
 6   (s_x_yield, MPa)                   3900 non-null   float64
 7   (r_value, 1)                       3900 non-null   float64
 8   (r_function, 1)                    3900 non-null   int64  
 9   (to, mm)                           3900 non-null   float64
 10  (wo, mm)                           3900 non-null   float64
 11  (lo, mm)                           3900 non-null   float

Now Pandas doesn't natively handle units, but we're goint to try and use the Pint package to do that. Pint only seems to work with the float columns currently. Have to figure that out.

In [17]:
df_numeric = df.drop(df.columns[df.dtypes != float], axis=1)

Pint black magic

In [19]:
df_pint = df_numeric.pint.quantify(level=-1)

MPa
1
m/m
mm


In [20]:
df_pint

Unnamed: 0,E,nu,K,epsilon_o,n_value,s_x_yield,r_value,to,wo,lo,...,UTS_data,r_applied_avg_ref,r_ISO_intercept_zero_ref,r_ISO_intercept_zero_data,r_ISO_intercept_notzero_ref,r_ISO_intercept_notzero_data,r_applied_point_ref,r_single_point_ref,r_single_point_data,n_value_data
0,70000.0,0.33,200.0,0.0083,0.2,79.863437,0.56,1.0,12.0,50.0,...,119.587222,0.5599999999999727,0.56,0.5600016102127136,0.5599999999999996,0.5600005915747454,0.56,0.56,0.5600012144841748,0.15847555195116286
1,70000.0,0.33,200.0,0.0083,0.2,79.863437,0.56,1.0,12.0,50.0,...,119.587222,0.560000000000002,0.56,0.5600045060177121,0.5599999999999996,0.5600012055678216,0.56,0.56,0.5600033281930664,0.1241380467003976
2,70000.0,0.33,200.0,0.0083,0.2,79.863437,0.56,1.0,12.0,50.0,...,119.587222,0.560000000000002,0.5599999999999996,0.5600024021802154,0.5599999999999996,0.5600006209787282,0.56,0.56,0.5600020047186465,0.17888312163978415
3,70000.0,0.33,200.0,0.0083,0.2,79.863437,0.56,1.0,12.0,50.0,...,119.587222,0.560000000000002,0.56,0.5600016865735418,0.56,0.5600004503951088,0.56,0.56,0.5600014942081785,0.18712188308993932
4,70000.0,0.33,200.0,0.0083,0.2,79.863437,0.56,1.0,12.0,50.0,...,119.587222,0.560000000000002,0.56,0.560001329473156,0.56,0.56000036297035,0.56,0.56,0.5600012144841748,0.19074499700990696
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3895,200000.0,0.3,1300.0,0.0005,0.14,559.3690190000001,0.8,1.0,12.0,50.0,...,857.3797310000001,0.8000000000000007,0.9189812341566038,0.8866350371503168,1.0518989932887088,1.0415817416548547,1.000321149951081,1.000321149951081,0.9732654145924464,0.15531367202680313
3896,200000.0,0.3,1300.0,0.0005,0.14,559.3690190000001,0.8,1.0,12.0,50.0,...,857.3797310000001,0.5667447622765223,0.64686757365775,0.6099146313472834,0.7633360339222487,0.7503295773351399,0.7073338684947345,0.7073338684947345,0.674196034093746,0.16937088733521444
3897,200000.0,0.3,1300.0,0.0005,0.14,559.3690190000001,0.8,1.0,12.0,50.0,...,857.3797310000001,0.7801942962349735,0.7930858325513875,0.7600278788962445,1.0097410462610976,1.0005521326210671,0.8411664690963376,0.8411664690963372,0.809891706252499,0.14901705671718876
3898,200000.0,0.3,1300.0,0.0005,0.14,559.3690190000001,0.8,1.0,12.0,50.0,...,857.3797310000001,0.888433376977257,0.8938347306443061,0.8636324662753228,1.1471022362314944,1.1392071390987941,0.9309037228070762,0.9309037228070762,0.901948892308952,0.14602869058588408


In [21]:
df_pint.E * df_pint.E

0        4900000000.0
1        4900000000.0
2        4900000000.0
3        4900000000.0
4        4900000000.0
            ...      
3895    40000000000.0
3896    40000000000.0
3897    40000000000.0
3898    40000000000.0
3899    40000000000.0
Name: E, Length: 3900, dtype: pint[megapascal ** 2]

One calculation that Mark wanted to do is find the difference between two different elasticities. This handles the units, but describe seems to be broken when units are included.

In [9]:
norm = (df_pint.mE - df_pint.E) / df_pint.E
norm

0       -0.0009093223428571946
1       -0.0009093223428571946
2       -0.0009093223428571946
3       -0.0009093223428571946
4       -0.0009093223428571946
                 ...          
3895    -0.0017932324900000822
3896    -0.0017932324900000822
3897    -0.0017932324900000822
3898    -0.0017932324900000822
3899    -0.0017932324900000822
Length: 3900, dtype: pint[dimensionless]

In [10]:
norm_ = (df.mE - df.E) / df.E
norm_.describe()

Unnamed: 0,MPa
count,3900.0
mean,-0.00225
std,0.00116
min,-0.003707
25%,-0.003578
50%,-0.001732
75%,-0.001582
max,-0.000549


In [11]:
df_pint.loc[:5, 'E':'K'].pint.to_base_units()

Unnamed: 0,E,nu,K
0,70000000000.0,0.33,200000000.0
1,70000000000.0,0.33,200000000.0
2,70000000000.0,0.33,200000000.0
3,70000000000.0,0.33,200000000.0
4,70000000000.0,0.33,200000000.0
5,70000000000.0,0.33,200000000.0


In [12]:
df_pint.loc[:5, 'E':'K']

Unnamed: 0,E,nu,K
0,70000.0,0.33,200.0
1,70000.0,0.33,200.0
2,70000.0,0.33,200.0
3,70000.0,0.33,200.0
4,70000.0,0.33,200.0
5,70000.0,0.33,200.0
